In [2]:
import os
In [3]:
import pandas as pd
In [4]:
import numpy as np
In [5]:
import matplotlib.pyplot as plt
In [6]:
import seaborn as sns

EDA- Analysis

In [7]:
cf=pd.read_excel("D:\\Jigsaw\\Project-student attrition\\Student2.xlsx")

shape is a tuple that gives you an indication of the number of dimensions in the array. Here It gives number of Rows and columns

In [8]:
cf.shape
Out[8]:
(3400, 56)

Pandas is a fantastic ecosystem of data-centric Python packages. Due to which is one of those packages and makes importing and analyzing data much easier.

head() method. Pandas head() method is used to return top n (5 by default) rows of a data frame or series.

In [9]:
cf.head()
Out[9]:
STUDENT IDENTIFIER STDNT_AGE STDNT_GENDER STDNT_BACKGROUND IN_STATE_FLAG INTERNATIONAL_STS STDNT_MAJOR STDNT_MINOR STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 ... DEGREE_GROUP_CD DEGREE_GROUP_DESC FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
0 7808615 18 F BGD 1 Y N Undeclared N NaN 1150.0 ... B Bachelors 16 16 14.0 14.0 0 0 0 0.0
1 7830063 19 F BGD 1 N N Undeclared N 26.0 NaN ... B Bachelors 18 18 18.0 18.0 570000 1355760 785760 459300.0
2 7847538 18 M BGD 1 Y N Mathematics N NaN 1020.0 ... B Bachelors 15 15 14.0 14.0 0 0 0 0.0
3 8006429 18 M BGD 1 Y N Undeclared N NaN 1210.0 ... B Bachelors 13 13 14.0 14.0 0 0 0 0.0
4 7962680 18 F BGD 1 Y N Art N NaN 1200.0 ... B Bachelors 12 12 12.0 12.0 835920 1355760 519840 278340.0

5 rows × 56 columns

The tail() function is used to get the last n rows. This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows. Number of rows to select by default it will show last 5 rows in python.

In [10]:
cf.tail()
Out[10]:
STUDENT IDENTIFIER STDNT_AGE STDNT_GENDER STDNT_BACKGROUND IN_STATE_FLAG INTERNATIONAL_STS STDNT_MAJOR STDNT_MINOR STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 ... DEGREE_GROUP_CD DEGREE_GROUP_DESC FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
3395 7948112 18 F BGD 4 N N Music Performance N NaN 800.0 ... B Bachelors 18 18 16.0 13.0 0 0 0 0.0
3396 8023055 18 F BGD 1 Y N Biology N 20.0 NaN ... B Bachelors 13 9 NaN NaN 0 0 0 0.0
3397 7926915 18 F BGD 4 Y N Biology N NaN 1020.0 ... B Bachelors 14 6 NaN NaN 0 945840 1288980 0.0
3398 7877332 18 F BGD 3 Y N Joint Enrollment - Accel N 21.0 NaN ... B Bachelors 12 9 NaN NaN 0 269100 0 -231720.0
3399 7928405 18 M BGD 2 Y N Art Creative Writing NaN 890.0 ... B Bachelors 12 9 13.0 6.0 0 0 0 0.0

5 rows × 56 columns

The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. Inorder to print the full summary.

In [11]:
cf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3400 entries, 0 to 3399
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   STUDENT IDENTIFIER        3400 non-null   int64  
 1   STDNT_AGE                 3400 non-null   int64  
 2   STDNT_GENDER              3400 non-null   object 
 3   STDNT_BACKGROUND          3400 non-null   object 
 4   IN_STATE_FLAG             3400 non-null   object 
 5   INTERNATIONAL_STS         3400 non-null   object 
 6   STDNT_MAJOR               3400 non-null   object 
 7   STDNT_MINOR               3400 non-null   object 
 8   STDNT_TEST_ENTRANCE1      1106 non-null   float64
 9   STDNT_TEST_ENTRANCE2      2492 non-null   float64
 10  STDNT_TEST_ENTRANCE_COMB  2882 non-null   float64
 11  FIRST_TERM                3400 non-null   int64  
 12  CORE_COURSE_NAME_1_F      3400 non-null   object 
 13  CORE_COURSE_GRADE_1_F     3400 non-null   object 
 14  CORE_COURSE_NAME_2_F      3301 non-null   object 
 15  CORE_COURSE_GRADE_2_F     3301 non-null   object 
 16  CORE_COURSE_NAME_3_F      2835 non-null   object 
 17  CORE_COURSE_GRADE_3_F     2835 non-null   object 
 18  CORE_COURSE_NAME_4_F      1803 non-null   object 
 19  CORE_COURSE_GRADE_4_F     1803 non-null   object 
 20  CORE_COURSE_NAME_5_F      645 non-null    object 
 21  CORE_COURSE_GRADE_5_F     645 non-null    object 
 22  CORE_COURSE_NAME_6_F      128 non-null    object 
 23  CORE_COURSE_GRADE_6_F     128 non-null    object 
 24  SECOND_TERM               3400 non-null   int64  
 25  CORE_COURSE_NAME_1_S      3243 non-null   object 
 26  CORE_COURSE_GRADE_1_S     3168 non-null   object 
 27  CORE_COURSE_NAME_2_S      2961 non-null   object 
 28  CORE_COURSE_GRADE_2_S     2961 non-null   object 
 29  CORE_COURSE_NAME_3_S      2362 non-null   object 
 30  CORE_COURSE_GRADE_3_S     2362 non-null   object 
 31  CORE_COURSE_NAME_4_S      1355 non-null   object 
 32  CORE_COURSE_GRADE_4_S     1355 non-null   object 
 33  CORE_COURSE_NAME_5_S      450 non-null    object 
 34  CORE_COURSE_GRADE_5_S     450 non-null    object 
 35  CORE_COURSE_NAME_6_S      81 non-null     object 
 36  CORE_COURSE_GRADE_6_S     81 non-null     object 
 37  HOUSING_STS               3400 non-null   object 
 38  RETURNED_2ND_YR           3400 non-null   int64  
 39  DISTANCE_FROM_HOME        3375 non-null   float64
 40  HIGH_SCHL_GPA             3347 non-null   float64
 41  HIGH_SCHL_NAME            3399 non-null   object 
 42  FATHER_HI_EDU_CD          2968 non-null   float64
 43  FATHER_HI_EDU_DESC        3400 non-null   object 
 44  MOTHER_HI_EDU_CD          2911 non-null   float64
 45  MOTHER_HI_EDU_DESC        3400 non-null   object 
 46  DEGREE_GROUP_CD           3400 non-null   object 
 47  DEGREE_GROUP_DESC         3400 non-null   object 
 48  FIRST_TERM_ATTEMPT_HRS    3400 non-null   int64  
 49  FIRST_TERM_EARNED_HRS     3400 non-null   int64  
 50  SECOND_TERM_ATTEMPT_HRS   3194 non-null   float64
 51  SECOND_TERM_EARNED_HRS    3191 non-null   float64
 52  GROSS_FIN_NEED            3400 non-null   int64  
 53  COST_OF_ATTEND            3400 non-null   int64  
 54  EST_FAM_CONTRIBUTION      3400 non-null   int64  
 55  UNMET_NEED                3400 non-null   float64
dtypes: float64(10), int64(10), object(36)
memory usage: 1.5+ MB
In [12]:
cf.columns
# It shows number of columns in entire data frame
Out[12]:
Index(['STUDENT IDENTIFIER', 'STDNT_AGE', 'STDNT_GENDER', 'STDNT_BACKGROUND',
       'IN_STATE_FLAG', 'INTERNATIONAL_STS', 'STDNT_MAJOR', 'STDNT_MINOR',
       'STDNT_TEST_ENTRANCE1', 'STDNT_TEST_ENTRANCE2',
       'STDNT_TEST_ENTRANCE_COMB', 'FIRST_TERM', 'CORE_COURSE_NAME_1_F',
       'CORE_COURSE_GRADE_1_F', 'CORE_COURSE_NAME_2_F',
       'CORE_COURSE_GRADE_2_F', 'CORE_COURSE_NAME_3_F',
       'CORE_COURSE_GRADE_3_F', 'CORE_COURSE_NAME_4_F',
       'CORE_COURSE_GRADE_4_F', 'CORE_COURSE_NAME_5_F',
       'CORE_COURSE_GRADE_5_F', 'CORE_COURSE_NAME_6_F',
       'CORE_COURSE_GRADE_6_F', 'SECOND_TERM', 'CORE_COURSE_NAME_1_S',
       'CORE_COURSE_GRADE_1_S', 'CORE_COURSE_NAME_2_S',
       'CORE_COURSE_GRADE_2_S', 'CORE_COURSE_NAME_3_S',
       'CORE_COURSE_GRADE_3_S', 'CORE_COURSE_NAME_4_S',
       'CORE_COURSE_GRADE_4_S', 'CORE_COURSE_NAME_5_S',
       'CORE_COURSE_GRADE_5_S', 'CORE_COURSE_NAME_6_S',
       'CORE_COURSE_GRADE_6_S', 'HOUSING_STS', 'RETURNED_2ND_YR',
       'DISTANCE_FROM_HOME', 'HIGH_SCHL_GPA', 'HIGH_SCHL_NAME',
       'FATHER_HI_EDU_CD', 'FATHER_HI_EDU_DESC', 'MOTHER_HI_EDU_CD',
       'MOTHER_HI_EDU_DESC', 'DEGREE_GROUP_CD', 'DEGREE_GROUP_DESC',
       'FIRST_TERM_ATTEMPT_HRS', 'FIRST_TERM_EARNED_HRS',
       'SECOND_TERM_ATTEMPT_HRS', 'SECOND_TERM_EARNED_HRS', 'GROSS_FIN_NEED',
       'COST_OF_ATTEND', 'EST_FAM_CONTRIBUTION', 'UNMET_NEED'],
      dtype='object')
In [13]:
cf.skew()
Out[13]:
STUDENT IDENTIFIER          -0.025792
STDNT_AGE                    4.416754
STDNT_TEST_ENTRANCE1         0.671944
STDNT_TEST_ENTRANCE2        -0.922564
STDNT_TEST_ENTRANCE_COMB    -0.364692
FIRST_TERM                  -0.170108
SECOND_TERM                 -0.170108
RETURNED_2ND_YR             -1.405151
DISTANCE_FROM_HOME          18.439510
HIGH_SCHL_GPA               -0.248503
FATHER_HI_EDU_CD             0.192835
MOTHER_HI_EDU_CD            -0.066167
FIRST_TERM_ATTEMPT_HRS       0.380062
FIRST_TERM_EARNED_HRS       -0.902359
SECOND_TERM_ATTEMPT_HRS     -0.178607
SECOND_TERM_EARNED_HRS      -0.891158
GROSS_FIN_NEED               1.235763
COST_OF_ATTEND               0.345186
EST_FAM_CONTRIBUTION         3.815042
UNMET_NEED                   0.665665
dtype: float64

This shows some columns are highly skewed & needs attention Highly skewed column includes, STDNT_AGE & DISTANCE_FROM_HOME As well SECOND_TERM_EARNED_HRS is highly negatively skewed

In [14]:
#check percentiles for some columns
quan=[]
for i in list(range(1,101)):
    quan.append(i/100)
print(cf['DISTANCE_FROM_HOME'].quantile(q=quan).tail())
print(cf['STDNT_AGE'].quantile(q=quan).tail())
0.96     208.0
0.97     235.0
0.98     275.0
0.99     666.0
1.00    5932.0
Name: DISTANCE_FROM_HOME, dtype: float64
0.96    19.0
0.97    19.0
0.98    19.0
0.99    19.0
1.00    26.0
Name: STDNT_AGE, dtype: float64
In [15]:
#Normalise distance fomhome column to treat skewness.Normalization converts all data points to decimals between 0 and 1. 
def normalize(column):
    upper = column.max()
    lower = column.min()
    y = (column - lower)/(upper-lower)
    return y
normalized = normalize(cf.DISTANCE_FROM_HOME)
normalized.describe()
Out[15]:
count    3375.000000
mean        0.016828
std         0.039649
min         0.000000
25%         0.011632
50%         0.011632
75%         0.023264
max         1.000000
Name: DISTANCE_FROM_HOME, dtype: float64
In [16]:
cf.shape
Out[16]:
(3400, 56)
In [17]:
#No of unique values per column 
for column in cf.columns:
    print(f"{column}: Number of unique values {cf[column].nunique()}")
    print("==========================================================")
STUDENT IDENTIFIER: Number of unique values 3400
==========================================================
STDNT_AGE: Number of unique values 11
==========================================================
STDNT_GENDER: Number of unique values 2
==========================================================
STDNT_BACKGROUND: Number of unique values 8
==========================================================
IN_STATE_FLAG: Number of unique values 2
==========================================================
INTERNATIONAL_STS: Number of unique values 2
==========================================================
STDNT_MAJOR: Number of unique values 54
==========================================================
STDNT_MINOR: Number of unique values 37
==========================================================
STDNT_TEST_ENTRANCE1: Number of unique values 23
==========================================================
STDNT_TEST_ENTRANCE2: Number of unique values 109
==========================================================
STDNT_TEST_ENTRANCE_COMB: Number of unique values 24
==========================================================
FIRST_TERM: Number of unique values 6
==========================================================
CORE_COURSE_NAME_1_F: Number of unique values 69
==========================================================
CORE_COURSE_GRADE_1_F: Number of unique values 7
==========================================================
CORE_COURSE_NAME_2_F: Number of unique values 74
==========================================================
CORE_COURSE_GRADE_2_F: Number of unique values 7
==========================================================
CORE_COURSE_NAME_3_F: Number of unique values 74
==========================================================
CORE_COURSE_GRADE_3_F: Number of unique values 6
==========================================================
CORE_COURSE_NAME_4_F: Number of unique values 68
==========================================================
CORE_COURSE_GRADE_4_F: Number of unique values 6
==========================================================
CORE_COURSE_NAME_5_F: Number of unique values 53
==========================================================
CORE_COURSE_GRADE_5_F: Number of unique values 4
==========================================================
CORE_COURSE_NAME_6_F: Number of unique values 24
==========================================================
CORE_COURSE_GRADE_6_F: Number of unique values 4
==========================================================
SECOND_TERM: Number of unique values 6
==========================================================
CORE_COURSE_NAME_1_S: Number of unique values 66
==========================================================
CORE_COURSE_GRADE_1_S: Number of unique values 8
==========================================================
CORE_COURSE_NAME_2_S: Number of unique values 69
==========================================================
CORE_COURSE_GRADE_2_S: Number of unique values 6
==========================================================
CORE_COURSE_NAME_3_S: Number of unique values 68
==========================================================
CORE_COURSE_GRADE_3_S: Number of unique values 6
==========================================================
CORE_COURSE_NAME_4_S: Number of unique values 64
==========================================================
CORE_COURSE_GRADE_4_S: Number of unique values 5
==========================================================
CORE_COURSE_NAME_5_S: Number of unique values 51
==========================================================
CORE_COURSE_GRADE_5_S: Number of unique values 5
==========================================================
CORE_COURSE_NAME_6_S: Number of unique values 26
==========================================================
CORE_COURSE_GRADE_6_S: Number of unique values 3
==========================================================
HOUSING_STS: Number of unique values 2
==========================================================
RETURNED_2ND_YR: Number of unique values 2
==========================================================
DISTANCE_FROM_HOME: Number of unique values 71
==========================================================
HIGH_SCHL_GPA: Number of unique values 200
==========================================================
HIGH_SCHL_NAME: Number of unique values 439
==========================================================
FATHER_HI_EDU_CD: Number of unique values 4
==========================================================
FATHER_HI_EDU_DESC: Number of unique values 5
==========================================================
MOTHER_HI_EDU_CD: Number of unique values 4
==========================================================
MOTHER_HI_EDU_DESC: Number of unique values 5
==========================================================
DEGREE_GROUP_CD: Number of unique values 3
==========================================================
DEGREE_GROUP_DESC: Number of unique values 3
==========================================================
FIRST_TERM_ATTEMPT_HRS: Number of unique values 12
==========================================================
FIRST_TERM_EARNED_HRS: Number of unique values 22
==========================================================
SECOND_TERM_ATTEMPT_HRS: Number of unique values 21
==========================================================
SECOND_TERM_EARNED_HRS: Number of unique values 23
==========================================================
GROSS_FIN_NEED: Number of unique values 927
==========================================================
COST_OF_ATTEND: Number of unique values 120
==========================================================
EST_FAM_CONTRIBUTION: Number of unique values 1236
==========================================================
UNMET_NEED: Number of unique values 1197
==========================================================

Categorical Features

In [18]:
demo1 = []
for column in cf.columns:
    if 1 < cf[column].nunique() <= 2:
   # if len(cf[column].unique()) == 2:
#    if raw1[column].dtype == object and len(raw1[column].unique()) <= 20:
        demo1.append(column)
        print(f"{column} : {cf[column].unique()}")
        print(cf[column].value_counts())
        print("====================================")
demo1.remove('RETURNED_2ND_YR')
STDNT_GENDER : ['F' 'M']
F    2140
M    1260
Name: STDNT_GENDER, dtype: int64
====================================
IN_STATE_FLAG : ['Y' 'N']
Y    3063
N     337
Name: IN_STATE_FLAG, dtype: int64
====================================
INTERNATIONAL_STS : ['N' 'Y']
N    3373
Y      27
Name: INTERNATIONAL_STS, dtype: int64
====================================
HOUSING_STS : ['On Campus' 'Off Campus']
Off Campus    1981
On Campus     1419
Name: HOUSING_STS, dtype: int64
====================================
RETURNED_2ND_YR : [0 1]
1    2677
0     723
Name: RETURNED_2ND_YR, dtype: int64
====================================
In [19]:
len(demo1)
Out[19]:
4
In [20]:
demo2 = []
for column in cf.columns:
    if 2 < cf[column].nunique() <= 20:
        print(f"{column} : {cf[column].unique()}")
        demo2.append(column)
        print("====================================")
STDNT_AGE : [18 19 17 20 16 21 23 25 22 26 24]
====================================
STDNT_BACKGROUND : ['BGD 1' 'BGD 3' 'BGD 6' 'BGD 4' 'BGD 8' 'BGD 2' 'BGD 7' 'BGD 5']
====================================
FIRST_TERM : [200708 200808 201008 200508 200608 200908]
====================================
CORE_COURSE_GRADE_1_F : ['A' 'B' 'C' 'NOT REP' 'D' 'F' 'INCOMPL']
====================================
CORE_COURSE_GRADE_2_F : ['A' 'B' 'C' 'D' 'F' nan 'NOT REP' 'INCOMPL']
====================================
CORE_COURSE_GRADE_3_F : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL']
====================================
CORE_COURSE_GRADE_4_F : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL']
====================================
CORE_COURSE_GRADE_5_F : [nan 'A' 'B' 'C' 'D']
====================================
CORE_COURSE_GRADE_6_F : [nan 'A' 'B' 'C' 'D']
====================================
SECOND_TERM : [200802 200902 201102 200602 200702 201002]
====================================
CORE_COURSE_GRADE_1_S : ['A' 'B' 'C' 'D' 'F' 'NOT REP' 'INCOMPL' 'Unknown' nan]
====================================
CORE_COURSE_GRADE_2_S : ['A' 'B' 'C' 'D' 'F' 'INCOMPL' nan]
====================================
CORE_COURSE_GRADE_3_S : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL']
====================================
CORE_COURSE_GRADE_4_S : ['A' nan 'B' 'C' 'D' 'F']
====================================
CORE_COURSE_GRADE_5_S : [nan 'A' 'B' 'C' 'D' 'F']
====================================
CORE_COURSE_GRADE_6_S : [nan 'A' 'B' 'C']
====================================
FATHER_HI_EDU_CD : [nan  3.  2.  1.  4.]
====================================
FATHER_HI_EDU_DESC : ['No Information' 'College/Beyond' 'High School'
 'Middle School/Junior High' 'Other/Unknown']
====================================
MOTHER_HI_EDU_CD : [nan  3.  4.  2.  1.]
====================================
MOTHER_HI_EDU_DESC : ['No Information' 'College/Beyond' 'Other/Unknown' 'High School'
 'Middle School/Junior High']
====================================
DEGREE_GROUP_CD : ['B' 'A' 'V']
====================================
DEGREE_GROUP_DESC : ['Bachelors' 'Associate' 'Career Associate']
====================================
FIRST_TERM_ATTEMPT_HRS : [16 18 15 13 12 17 14 19 20 21  9 11]
====================================
In [21]:
len(demo2)
Out[21]:
23
In [22]:
demo3 = []
for column in cf.columns:
    if 20 < cf[column].nunique() <= 100:
        print(f"{column} : {cf[column].unique()}")
        demo3.append(column)
        print("====================================")
STDNT_MAJOR : ['Undeclared' 'Mathematics' 'Art' 'Political Science' 'Biology'
 'Accounting' 'Criminal Justice' 'Psychology' 'Theatre Arts' 'Nursing'
 'Pre-Nursing' 'Chemistry' 'Pre-Engineering/RETP'
 'Early Childhood Education' 'Music Performance'
 'English Language/Literature' 'Joint Enrollment - Accel'
 'History and Secondary Ed' 'History' 'Engineering Studies'
 'Computer Science - Systems' 'Pre-Business' 'Management'
 'Theatre Education' 'Marketing' 'Communication' 'Music Education'
 'French with Teacher Cert' 'English and Secondary Ed' 'Exercise Science'
 'General Business' 'Mathematics and Secondary Ed' 'Finance' 'French'
 'Health Science' 'Spec Ed: Gen. Curr. - Reading' 'Music'
 'Middle Grades Education' 'Geology' 'Management Information Systems'
 'Art Education' 'Applied Computer Science' 'Information Technology'
 'Sociology' 'Computer Science - Games' 'Spanish' 'General Studies/AS'
 'Biology and Secondary Ed' 'Chemistry and Secondary Ed'
 'Health and Physical Education' 'Liberal Arts'
 'Spanish with Teacher Cert' 'Earth and Space Science'
 'Early Admission - Accel']
====================================
STDNT_MINOR : ['N' 'Theatre Arts' 'Foundations of Business' 'Spanish'
 'Professional Writing' 'Computer Science - Systems' 'Art' 'Mathematics'
 'Chemistry' 'Psychology' "Women's Studies" 'Creative Writing' 'Biology'
 'History' 'Early Childhood Education' 'Health Science' 'Accounting'
 'Political Science' 'Communication' 'Asian Studies' 'Criminal Justice'
 'English Language/Literature' 'Art History' 'Marketing' 'French'
 'Economics' 'Sociology' 'International Business' 'Music'
 'Exercise Science' 'Finance' 'Computer Info. Management'
 'Military Sci & Adv Leadership' 'African Studies' 'Management'
 'Anthropology' 'Philosophy']
====================================
STDNT_TEST_ENTRANCE1 : [nan 26. 27. 28. 23. 21. 19. 31. 25. 20. 14. 22. 30. 24. 18. 32. 29. 16.
 15. 17.  8. 11. 13. 12.]
====================================
STDNT_TEST_ENTRANCE_COMB : [1150. 1190. 1030. 1220. 1070.  590.  740.   nan 1260. 1110.  990.  910.
 1300.  640.  530. 1340. 1380.  950.  690.  870.  790. 1420.  830. 1510.
 1460.]
====================================
CORE_COURSE_NAME_1_F : ['ANTH 1105' 'ANTH 1107' 'ARAB 1001' 'ARTH 1100' 'BIOL 1125' 'BIOL 1215K'
 'CHEM 1151' 'CHEM 1211' 'COMM 1110' 'COMM 1110H' 'CPSC 1105' 'CPSC 1301'
 'ECON 2105' 'ECON 2106' 'ENGL 1101' 'ENGL 1102' 'FREN 1001' 'HIST 1111'
 'HIST 2111' 'HIST 2112' 'LEAD 1705' 'MATH 1101' 'MUSC 1100' 'CHEM 1151L'
 'ENVS 1105' 'GEOL 1121L' 'HIST 1112' 'HIST 1112I' 'ITDS 1155' 'MATH 1111'
 'POLS 1101' 'THEA 1100' 'MATH 1131' 'GEOG 1101I' 'ASTR 1105' 'PSYC 1101'
 'ENGL 1101I' 'MATH 1132' 'MATH 1113' 'GEOL 1121' 'ARTH 1100I' 'ITDS 1145'
 'ENVS 1105I' 'ITDS 2735' 'POLS 1101H' 'SOCI 1101' 'SPAN 1001'
 'CHEM 1211L' 'GEOL 1110' 'STAT 1127' 'MATH 1125' 'CPSC 1301L' 'PHYS 1125'
 'ANTH 1105I' 'THEA 1100I' 'COMM 1110I' 'ITDS 2748' 'SPAN 1002'
 'PHYS 2311' 'MATH 2125' 'FREN 1002' 'GERM 1002' 'ITDS 2749' 'ASTR 1305'
 'ITDS 2726' 'POLS 2401' 'ITDS 2727' 'ITDS 1146' 'ENGL 2111']
====================================
CORE_COURSE_NAME_2_F : ['ARTH 1100' 'GERM 1002' 'ENGL 1101' 'COMM 1110' 'HIST 2112' 'CHEM 1211'
 'ENGL 1102' 'ECON 2105' 'ITDS 2735' 'CHEM 1151L' 'CHEM 1211L' 'MATH 1111'
 'FREN 1002' 'ENGL 2111' 'ENVS 1105' 'HIST 1111' 'HIST 2111' 'CHEM 1151'
 'BIOL 1215K' 'CPSC 1301L' 'MATH 1113' 'SOCI 1101' 'SPAN 1002' 'MATH 1131'
 'MUSC 1100' 'LEAD 1705' 'HIST 1112' 'PSYC 1101' 'THEA 1100' 'POLS 1101'
 'CPSC 1105' 'CPSC 1301' 'ITDS 1145' 'FREN 2001' 'GEOL 1121' 'ARTH 1100I'
 'COMM 1110I' 'ECON 2106' 'ANTH 1105' 'COMM 1110H' 'ITDS 1155'
 'POLS 1101H' 'MATH 1125' 'HIST 1112I' 'SPAN 1001' 'BIOL 1125' 'ASTR 1105'
 'PHYS 2311' 'GEOG 1101I' nan 'ENGL 1101I' 'FREN 1001' 'STAT 1127'
 'MATH 2115' 'ITDS 1156' 'POLS 2401' 'MATH 1101' 'GEOL 1121L' 'GEOL 1110'
 'PHYS 1125' 'JAPN 1001' 'ENGL 2136' 'ITDS 2726' 'ITDS 2749' 'ANTH 1105I'
 'ITDS 1146' 'MATH 1132' 'ANTH 1107' 'ENVS 1105I' 'PHYS 1325' 'ITDS 2748'
 'THEA 1100I' 'ASTR 1305' 'ITDS 2727' 'MATH 2125']
====================================
CORE_COURSE_NAME_3_F : ['ENGL 1102' 'ITDS 2735' 'HIST 2111' nan 'CPSC 1105' 'MATH 1101'
 'CHEM 1211L' 'ENGL 1101' 'HIST 1111' 'MATH 1132' 'POLS 2401' 'MATH 1113'
 'STAT 1127' 'MATH 1111' 'ECON 2105' 'COMM 1110' 'FREN 1001' 'HIST 2112'
 'POLS 1101H' 'LEAD 1705' 'MUSC 1100' 'PSYC 1101' 'ITDS 2749' 'THEA 1100'
 'HIST 1112' 'ITDS 2726H' 'CHEM 1151L' 'ITDS 1155' 'POLS 1101' 'SOCI 1101'
 'SPAN 1002' 'MATH 1125' 'CPSC 1301L' 'ITDS 1145' 'ENGL 1101I' 'ITDS 2726'
 'CHEM 1211' 'CHEM 1151' 'BIOL 1215K' 'ARTH 1100' 'MATH 1131' 'JAPN 1001'
 'ENVS 1105I' 'SPAN 1001' 'ENVS 1105' 'HIST 1112I' 'COMM 1110H'
 'GEOL 1110' 'THEA 1100I' 'ECON 2106' 'PHYS 1125' 'MATH 2125' 'ANTH 1105'
 'CPSC 1301' 'FREN 1002' 'GEOG 1101I' 'ITDS 2727' 'ENGL 2111' 'ASTR 1105'
 'BIOL 1225K' 'PHYS 1325' 'ASTR 1305' 'ITDS 2748' 'ITDS 2746' 'GEOL 1121L'
 'BIOL 1125' 'MATH 2115' 'ARTH 1100I' 'ENGL 2136' 'COMM 1110I' 'ITDS 1146'
 'GEOL 1121' 'ANTH 1107' 'PHYS 2211' 'LATN 1002']
====================================
CORE_COURSE_NAME_4_F : ['PSYC 1101' 'MATH 1125' 'MATH 1111' nan 'MUSC 1100' 'ENGL 1101'
 'POLS 1101' 'MATH 1101' 'ECON 2105' 'HIST 1111' 'HIST 2111' 'LEAD 1705'
 'HIST 2112' 'MATH 1113' 'THEA 1100' 'ITDS 2749' 'SOCI 1101' 'MATH 2115'
 'ENGL 1102' 'ITDS 2735' 'SPAN 1002' 'SPAN 1001' 'PHYS 1125' 'ITDS 2748'
 'MATH 1131' 'ECON 2106' 'CHEM 1211L' 'CHEM 1151L' 'CHEM 1151' 'COMM 1110'
 'ITDS 2726H' 'ARTH 1100I' 'PHYS 2211' 'POLS 2401' 'GEOG 1101I'
 'STAT 1127' 'ITDS 1155' 'PHYS 1325' 'FREN 2001' 'FREN 1001' 'CHEM 1211'
 'CPSC 1301L' 'CPSC 1301' 'BIOL 1215K' 'JAPN 1001' 'ARTH 1100' 'HIST 1112'
 'COMM 1110H' 'ENGL 1101I' 'CPSC 1105' 'THEA 1100I' 'ITDS 2726'
 'ANTH 1105' 'ITDS 2746' 'HIST 1112I' 'ITDS 2727' 'GEOL 1121L' 'ENGL 2136'
 'ENVS 1105I' 'ASTR 1305' 'MATH 1132' 'ITDS 1146' 'GEOL 1110' 'COMM 1110I'
 'BIOL 1125' 'ENVS 1105' 'PHYS 2311' 'GERM 1002' 'POLS 1101H']
====================================
CORE_COURSE_NAME_5_F : [nan 'THEA 1100' 'SOCI 1101' 'PSYC 1101' 'HIST 2111' 'POLS 1101'
 'ENGL 1101' 'MATH 1101' 'MUSC 1100' 'MATH 1111' 'STAT 1127' 'POLS 1101H'
 'JAPN 1001' 'SPAN 1001' 'POLS 2401' 'ITDS 2749' 'CPSC 1105' 'ITDS 2748'
 'SPAN 1002' 'MATH 1131' 'ITDS 1156' 'Life and Career Planning'
 'ITDS 2735' 'COMM 1110' 'CHEM 1211L' 'MATH 1113' 'PHYS 1325' 'HIST 1112I'
 'HIST 2112' 'ITDS 1145' 'ENGL 1102' 'LEAD 1705' 'ECON 2105' 'CHEM 1151L'
 'CHEM 1211' 'ARTH 1100' 'ITDS 2726' 'CPSC 1301L' 'ECON 2106' 'FREN 1001'
 'MATH 1125' 'BIOL 1215K' 'ENVS 1105I' 'GEOL 1110' 'FREN 1002' 'GEOL 1121'
 'ITDS 1146' 'MATH 2115' 'COMM 1110I' 'HIST 1111' 'HIST 1112' 'CPSC 1301'
 'ITDS 2746' 'CHEM 1151']
====================================
CORE_COURSE_NAME_6_F : [nan 'POLS 1101' 'MATH 1113' 'MATH 1111' 'LEAD 1705' 'SOCI 1101'
 'MUSC 1100' 'U.S. History to 1865' 'MATH 1101' 'ENGL 1101' 'ITDS 2749'
 'CHEM 1151L' 'PSYC 1101' 'COMM 1110' 'HIST 1112' 'THEA 1100' 'ITDS 2735'
 'HIST 2112' 'SPAN 1001' 'CHEM 1211L' 'HIST 2111' 'CPSC 1301L' 'MATH 1131'
 'FREN 2002' 'ITDS 2748']
====================================
CORE_COURSE_NAME_1_S : ['CHEM 1151' 'ECON 2105' 'ENGL 1102' 'COMM 1110' 'BIOL 1215K' 'ECON 2106'
 'SPAN 1001' 'MATH 1101' 'ARTH 1100' 'BIOL 1225K' 'HIST 2112' 'COMM 1110H'
 'CPSC 1105' 'HIST 2111' 'ITDS 2726' 'MATH 1111' 'GEOG 1101I' 'ANTH 1107'
 'ENGL 1101' 'ASTR 1105' 'COMM 1110I' 'ANTH 1105I' 'BIOL 1125' 'CPSC 1301'
 'ANTH 1105' 'STAT 1127' 'CHEM 1211' 'MUSC 1100' 'FREN 1001' 'SPAN 1002'
 'ENGL 2111' 'GEOL 1121' 'ITDS 1145' 'ENVS 1105' 'GEOL 1121L' 'POLS 1101'
 'PSYC 1101' 'THEA 1100' 'MATH 1131' 'ENGL 2136' 'MATH 1113' 'FREN 1002'
 'CHEM 1151L' 'HIST 1111' 'ITDS 1156' 'GEOL 1110' 'ITDS 2735' 'MATH 1132'
 'POLS 2401' 'PHYS 1125' 'SOCI 1101' 'CHEM 1211L' 'MATH 2125' 'FREN 2001'
 'HIST 1112' 'HIST 1112I' 'ITDS 1155' 'FREN 2002' 'ENGL 1101I' 'MATH 1125'
 'MATH 2115' 'CPSC 1301L' 'ITDS 2749' 'ITDS 2727' 'POLS 1101H' 'ITDS 2748'
 nan]
====================================
CORE_COURSE_NAME_2_S : ['CHEM 1151L' 'ENGL 1102' 'ENVS 1105' 'HIST 2112' 'ENGL 1101' 'ITDS 2735'
 'HIST 2111' 'STAT 1127' 'THEA 1100' 'CHEM 1211' 'MUSC 1100' 'PSYC 1101'
 'COMM 1110' 'FREN 1002' 'MATH 1131' 'COMM 1110H' 'FREN 2001' 'MATH 1111'
 'SPAN 1001' 'POLS 1101' 'ITDS 2727' 'ECON 2105' 'POLS 2401' 'MATH 1132'
 'MATH 1125' 'MATH 1101' 'GEOG 1101I' 'MATH 1113' 'ITDS 2748' 'ITDS 1145'
 'PHYS 1125' 'CHEM 1151' 'BIOL 1215K' 'HIST 1112I' 'GEOL 1110'
 'CPSC 1301L' 'ENGL 2111' 'ITDS 1156' 'ECON 2106' 'ITDS 1146' 'HIST 1112'
 'CHEM 1211L' 'SOCI 1101' 'CPSC 1105' 'SPAN 1002' 'LEAD 1705' 'ENGL 2136'
 'GEOL 1121L' 'ITDS 2726' 'ARTH 1100' 'FREN 1001' 'COMM 1110I' 'BIOL 1125'
 'ANTH 1105' 'ANTH 1105I' 'CPSC 1301' 'ASTR 1105' 'ITDS 1155' 'ITDS 2749'
 'HIST 1111' 'GEOL 1121' 'ANTH 1107' 'POLS 1101H' 'ENGL 1101I' 'GEOL 2225'
 'MATH 2115' 'BIOL 1225K' 'MATH 2125' 'LATN 1002' nan]
====================================
CORE_COURSE_NAME_3_S : ['COMM 1110' nan 'MUSC 1100' 'LEAD 1705' 'HIST 2111' 'ITDS 1156'
 'MATH 1111' 'HIST 1112' 'POLS 1101H' 'CHEM 1211L' 'ENGL 1102' 'PSYC 1101'
 'POLS 1101' 'ENVS 1105' 'MATH 1113' 'ITDS 1145' 'ITDS 2735' 'HIST 2112'
 'MATH 1125' 'ITDS 2726' 'THEA 1100' 'MATH 1101' 'SPAN 1001' 'SOCI 1101'
 'MATH 1131' 'ENGL 1101' 'ENGL 2136' 'MATH 1132' 'GEOL 1110' 'CHEM 1151L'
 'ITDS 2727' 'GEOL 1121' 'ASTR 1105' 'FREN 2001' 'STAT 1127' 'PHYS 1125'
 'ECON 2105' 'PHYS 1325' 'ITDS 2749' 'ECON 2106' 'SPAN 1002' 'HIST 1111'
 'BIOL 1215K' 'ITDS 2746' 'FREN 1001' 'ITDS 1155' 'ITDS 2748' 'FREN 1002'
 'CHEM 1151' 'FREN 2002' 'POLS 2401' 'GEOL 1121L' 'GEOG 1101I' 'ARTH 1100'
 'CPSC 1105' 'ANTH 1105I' 'CHEM 1211' 'CPSC 1301' 'ANTH 1107' 'BIOL 1125'
 'ITDS 1146' 'MATH 2125' 'HIST 1112I' 'ENGL 2111' 'ANTH 1105' 'ENGL 1101I'
 'CPSC 1301L' 'MATH 2115' 'COMM 1110I']
====================================
CORE_COURSE_NAME_4_S : ['ITDS 2749' nan 'MUSC 1100' 'PSYC 1101' 'POLS 1101' 'MATH 1125'
 'SPAN 1001' 'HIST 1111' 'ENGL 1102' 'HIST 2112' 'POLS 2401' 'MATH 1101'
 'ITDS 2735' 'HIST 2111' 'MATH 1111' 'HIST 1112' 'STAT 1127' 'ITDS 1145'
 'GEOL 1121L' 'COMM 1110' 'THEA 1100' 'LEAD 1705' 'ITDS 1155' 'MATH 1113'
 'MATH 1132' 'ENGL 1101' 'ECON 2106' 'MATH 1131' 'SOCI 1101' 'ITDS 2726'
 'ITDS 2746' 'PHYS 1125' 'ITDS 2748' 'FREN 2001' 'SPAN 1002' 'ITDS 1146'
 'CHEM 1211L' 'CPSC 1301L' 'FREN 1002' 'CHEM 1211' 'ARTH 1100' 'CPSC 1105'
 'FREN 1001' 'GEOG 1101I' 'CHEM 1151' 'BIOL 1215K' 'CHEM 1151L'
 'ECON 2105' 'GEOL 1110' 'COMM 1110H' 'ASTR 1105' 'COMM 1110I' 'ITDS 2727'
 'MATH 2125' 'PHYS 1325' 'ANTH 1105I' 'ENVS 1105' 'ENGL 2136' 'BIOL 1225K'
 'ANTH 1105' 'GEOL 1121' 'CPSC 1301' 'ITDS 1156' 'HIST 1112I' 'ENGL 1101I']
====================================
CORE_COURSE_NAME_5_S : [nan 'MATH 1113' 'MATH 1111' 'ITDS 2749' 'POLS 1101' 'MUSC 1100'
 'PSYC 1101' 'ITDS 2748' 'SPAN 1002' 'FREN 1001' 'ENGL 2136' 'MATH 1101'
 'THEA 1100' 'SPAN 1001' 'ITDS 2746' 'MATH 2125' 'ITDS 1145' 'STAT 1127'
 'ENGL 1102' 'ITDS 2735' 'ITDS 1146' 'SOCI 1101' 'MATH 1132' 'ITDS 2726'
 'CPSC 1301L' 'ITDS 2727' 'COMM 1110' 'HIST 2112' 'CHEM 1211L'
 'CHEM 1151L' 'ENVS 1105' 'MATH 1125' 'HIST 2111' 'CHEM 1151' 'CPSC 1105'
 'GEOL 1121L' 'ECON 2105' 'HIST 1112' 'ITDS 1156' 'MATH 1131' 'ECON 2106'
 'ENGL 1101' 'ITDS 1155' 'PHYS 1125' 'LEAD 1705' 'GEOL 1110' 'POLS 2401'
 'FREN 1002' 'ARTH 1100' 'HIST 1111' 'CPSC 1301' 'BIOL 1215K']
====================================
CORE_COURSE_NAME_6_S : [nan 'PSYC 1101' 'POLS 1101' 'HIST 2112' 'STAT 1127' 'SPAN 1001'
 'MUSC 1100' 'ENGL 1101' 'SPAN 1002' 'MATH 1101' 'ENGL 1102' 'CHEM 1151L'
 'ITDS 2746' 'ITDS 2735' 'LEAD 1705' 'SOCI 1101' 'MATH 1111' 'ITDS 2727'
 'MATH 1125' 'ITDS 2749' 'THEA 1100' 'COMM 1110' 'HIST 2111' 'CPSC 1105'
 'ITDS 2748' 'CPSC 1301L' 'CHEM 1151']
====================================
DISTANCE_FROM_HOME : [ 150.   69.    0. 4501.  661.  138.  283.   90.  188.  614.   nan   59.
  997.  181.  666.  190.  176.  222.  208.  778.  235.  513.  215.   91.
  310.  246.  722.  137.  572. 1854.  117.  136. 5932.  237.  441.  274.
  270.  301.  703.  398.  275.  225.  355.  299.  871.  711.  151. 1270.
 2221.  528.  694.  316.  477. 1735.  574.  421. 1078.  769. 1222.  330.
  382.  517.  403.  609. 1161.  556. 1001.  919. 2203.  366.  786. 1216.]
====================================
FIRST_TERM_EARNED_HRS : [16 18 15 13 12 17 14 10  9 11  8  7  6 19 21  0  2  4  3  5 20  1]
====================================
SECOND_TERM_ATTEMPT_HRS : [14. 18. 12. 16. 15. 13. 17.  7. 19.  9.  6. 21. 20. 11. 23. 10. 22.  8.
  4. nan  3.  2.]
====================================
SECOND_TERM_EARNED_HRS : [14. 18. 12. 16. 15. 13. 17.  7. 19.  8.  9. 10. 11.  6. 21. 20. 23. nan
  3.  5.  4.  0.  2.  1.]
====================================
In [23]:
len(demo3)
Out[23]:
20
In [24]:
demo4 = []
for column in cf.columns:
    if cf[column].nunique() > 100:
        print(f"{column} : {cf[column].unique()}")
        demo4.append(column)
        print("====================================")
STUDENT IDENTIFIER : [7808615 7830063 7847538 ... 7926915 7877332 7928405]
====================================
STDNT_TEST_ENTRANCE2 : [1150.   nan 1020. 1210. 1200. 1050.  580.  750. 1060. 1250. 1230. 1220.
 1040. 1110.  930. 1310. 1300.  610. 1120.  650.  550.  620.  600.  590.
 1130. 1350. 1080.  510. 1330. 1170. 1140. 1180. 1270. 1160. 1010. 1000.
 1360. 1190.  640.  500.  740. 1260. 1100. 1390. 1070.  690.  940.  560.
  980.  460.  920.  950.  410.  990. 1290. 1280. 1240. 1030.  900. 1090.
  570.  480.  780.  890. 1370.  970.  450. 1320.  960. 1420. 1400.  520.
  530.  630. 1340.  490.  880.  840.  540.  910.  830.  860.  470.  820.
  870.  760.  850.  810.  400.  800. 1430.  770.  660.  420.  700.  430.
  390. 1380.  370.  440. 1490.  790.  730.  670.  720.  360.  350. 1440.
 1410.  710.]
====================================
HIGH_SCHL_GPA : [4.     2.89   3.39   2.93   3.86   3.93   3.69   3.34   3.96   3.35
    nan 3.5    2.4    3.71   3.72   3.73   3.12   3.57   3.81   3.9
 3.92   3.43   3.6    2.97   3.58   3.79   3.54   3.46   3.67   2.9
 3.63   3.8    3.75   3.56   3.53   3.78   3.33   3.     3.61   3.3
 3.21   3.37   2.66   3.82   3.62   2.88   3.87   3.88   3.26   3.7
 3.74   2.8    2.63   3.47   3.52   3.68   3.25   3.28   3.06   3.66
 3.64   3.4375 3.85   3.83   3.07   2.96   3.36   3.16   3.15   2.76
 3.31   3.77   3.13   3.4    3.18   3.27   3.84   3.19   2.92   3.94
 2.73   3.14   2.81   2.64   3.08   3.09   3.65   3.38   2.51   3.1
 3.97   3.2    2.6    3.17   3.23   2.84   2.87   3.42   3.03   2.77
 3.55   2.72   2.79   2.94   3.76   2.58   3.04   2.61   2.85   2.86
 2.71   3.32   2.82   3.95   2.21   2.33   2.78   2.62   2.54   3.29
 2.83   2.75   2.57   2.31   2.5    2.38   3.11   2.26   2.74   2.45
 3.59   2.44   2.37   2.7    3.48   2.43   2.1    3.22   2.42   2.46
 2.25   2.69   3.875  2.923  2.     3.44   3.41   2.04   2.65   3.625
 2.67   2.68   2.47   3.24   2.36   2.59   2.656  2.29   2.53   2.32
 2.41   2.875  2.56   3.9375 2.91   2.07   2.55   2.52   2.3    2.375
 2.35   2.2    3.45   2.13   2.17   3.01   2.15   2.687  2.95   2.18
 3.51   3.688  2.27   2.625  2.28   1.92   2.23   2.48   2.06   2.39
 2.24   0.     2.34   2.813  3.148  2.16   3.89   2.08   1.87   3.6875
 1.94  ]
====================================
HIGH_SCHL_NAME : ['SCHOOL 50' 'SCHOOL 389' 'SCHOOL 82' 'SCHOOL 180' 'SCHOOL 2' 'SCHOOL 5'
 'SCHOOL 130' 'SCHOOL 54' 'SCHOOL 189' 'SCHOOL 342' 'SCHOOL 11'
 'SCHOOL 28' 'SCHOOL 19' 'SCHOOL 98' 'SCHOOL 14' 'SCHOOL 121' 'SCHOOL 15'
 'SCHOOL 283' 'SCHOOL 304' 'SCHOOL 61' 'SCHOOL 229' 'SCHOOL 148'
 'SCHOOL 298' 'SCHOOL 175' 'SCHOOL 26' 'SCHOOL 88' 'SCHOOL 131'
 'SCHOOL 33' 'SCHOOL 128' 'SCHOOL 8' 'SCHOOL 223' 'SCHOOL 27' 'SCHOOL 52'
 'SCHOOL 29' 'SCHOOL 70' 'SCHOOL 20' 'SCHOOL 184' 'SCHOOL 453' 'SCHOOL 72'
 'SCHOOL 7' 'SCHOOL 25' 'SCHOOL 1' 'SCHOOL 18' 'SCHOOL 9' 'SCHOOL 216'
 'SCHOOL 10' 'SCHOOL 311' 'SCHOOL 34' 'SCHOOL 43' 'SCHOOL 45' 'SCHOOL 74'
 'SCHOOL 423' 'SCHOOL 132' 'SCHOOL 46' 'SCHOOL 47' 'SCHOOL 200'
 'SCHOOL 83' 'SCHOOL 292' 'SCHOOL 129' 'SCHOOL 66' 'SCHOOL 12'
 'SCHOOL 319' 'SCHOOL 3' 'SCHOOL 85' 'SCHOOL 49' 'SCHOOL 339' 'SCHOOL 23'
 'SCHOOL 111' 'SCHOOL 366' 'SCHOOL 440' 'SCHOOL 141' 'SCHOOL 13'
 'SCHOOL 48' 'SCHOOL 91' 'SCHOOL 22' 'SCHOOL 236' 'SCHOOL 263' 'SCHOOL 37'
 'SCHOOL 95' 'SCHOOL 55' 'SCHOOL 125' 'SCHOOL 196' 'SCHOOL 277'
 'SCHOOL 367' 'SCHOOL 317' 'SCHOOL 159' 'SCHOOL 38' 'SCHOOL 4' 'SCHOOL 87'
 'SCHOOL 78' 'SCHOOL 16' 'SCHOOL 336' 'SCHOOL 149' 'SCHOOL 155'
 'SCHOOL 177' 'SCHOOL 523' 'SCHOOL 285' 'SCHOOL 171' 'SCHOOL 65'
 'SCHOOL 262' 'SCHOOL 410' 'SCHOOL 30' 'SCHOOL 17' 'SCHOOL 230'
 'SCHOOL 57' 'SCHOOL 42' 'SCHOOL 284' 'SCHOOL 114' 'SCHOOL 386' nan
 'SCHOOL 53' 'SCHOOL 419' 'SCHOOL 226' 'SCHOOL 340' 'SCHOOL 288'
 'SCHOOL 154' 'SCHOOL 274' 'SCHOOL 6' 'SCHOOL 123' 'SCHOOL 384'
 'SCHOOL 515' 'SCHOOL 163' 'SCHOOL 505' 'SCHOOL 327' 'SCHOOL 24'
 'SCHOOL 295' 'SCHOOL 434' 'SCHOOL 416' 'SCHOOL 502' 'SCHOOL 477'
 'SCHOOL 93' 'SCHOOL 158' 'SCHOOL 312' 'SCHOOL 412' 'SCHOOL 219'
 'SCHOOL 310' 'SCHOOL 79' 'SCHOOL 73' 'SCHOOL 272' 'SCHOOL 120'
 'SCHOOL 153' 'SCHOOL 486' 'SCHOOL 315' 'SCHOOL 135' 'SCHOOL 100'
 'SCHOOL 234' 'SCHOOL 273' 'SCHOOL 137' 'SCHOOL 179' 'SCHOOL 428'
 'SCHOOL 452' 'SCHOOL 44' 'SCHOOL 152' 'SCHOOL 115' 'SCHOOL 139'
 'SCHOOL 398' 'SCHOOL 99' 'SCHOOL 140' 'SCHOOL 369' 'SCHOOL 157'
 'SCHOOL 333' 'SCHOOL 183' 'SCHOOL 127' 'SCHOOL 150' 'SCHOOL 126'
 'SCHOOL 258' 'SCHOOL 268' 'SCHOOL 278' 'SCHOOL 259' 'SCHOOL 41'
 'SCHOOL 143' 'SCHOOL 314' 'SCHOOL 489' 'SCHOOL 119' 'SCHOOL 465'
 'SCHOOL 395' 'SCHOOL 31' 'SCHOOL 276' 'SCHOOL 84' 'SCHOOL 108'
 'SCHOOL 69' 'SCHOOL 62' 'SCHOOL 356' 'SCHOOL 232' 'SCHOOL 467'
 'SCHOOL 521' 'SCHOOL 35' 'SCHOOL 397' 'SCHOOL 107' 'SCHOOL 151'
 'SCHOOL 334' 'SCHOOL 162' 'SCHOOL 270' 'SCHOOL 293' 'SCHOOL 144'
 'SCHOOL 71' 'SCHOOL 192' 'SCHOOL 133' 'SCHOOL 308' 'SCHOOL 378'
 'SCHOOL 245' 'SCHOOL 519' 'SCHOOL 483' 'SCHOOL 104' 'SCHOOL 89'
 'SCHOOL 279' 'SCHOOL 321' 'SCHOOL 482' 'SCHOOL 185' 'SCHOOL 40'
 'SCHOOL 271' 'SCHOOL 424' 'SCHOOL 136' 'SCHOOL 501' 'SCHOOL 197'
 'SCHOOL 240' 'SCHOOL 326' 'SCHOOL 267' 'SCHOOL 204' 'SCHOOL 146'
 'SCHOOL 260' 'SCHOOL 318' 'SCHOOL 286' 'SCHOOL 51' 'SCHOOL 316'
 'SCHOOL 431' 'SCHOOL 244' 'SCHOOL 337' 'SCHOOL 36' 'SCHOOL 181'
 'SCHOOL 459' 'SCHOOL 466' 'SCHOOL 256' 'SCHOOL 422' 'SCHOOL 92'
 'SCHOOL 238' 'SCHOOL 350' 'SCHOOL 508' 'SCHOOL 287' 'SCHOOL 76'
 'SCHOOL 289' 'SCHOOL 178' 'SCHOOL 469' 'SCHOOL 325' 'SCHOOL 86'
 'SCHOOL 457' 'SCHOOL 294' 'SCHOOL 492' 'SCHOOL 21' 'SCHOOL 518'
 'SCHOOL 341' 'SCHOOL 464' 'SCHOOL 429' 'SCHOOL 381' 'SCHOOL 399'
 'SCHOOL 522' 'SCHOOL 32' 'SCHOOL 198' 'SCHOOL 235' 'SCHOOL 455'
 'SCHOOL 280' 'SCHOOL 487' 'SCHOOL 463' 'SCHOOL 203' 'SCHOOL 253'
 'SCHOOL 338' 'SCHOOL 186' 'SCHOOL 361' 'SCHOOL 67' 'SCHOOL 81'
 'SCHOOL 403' 'SCHOOL 39' 'SCHOOL 182' 'SCHOOL 112' 'SCHOOL 56'
 'SCHOOL 207' 'SCHOOL 313' 'SCHOOL 227' 'SCHOOL 249' 'SCHOOL 134'
 'SCHOOL 371' 'SCHOOL 63' 'SCHOOL 370' 'SCHOOL 520' 'SCHOOL 257'
 'SCHOOL 239' 'SCHOOL 80' 'SCHOOL 106' 'SCHOOL 300' 'SCHOOL 394'
 'SCHOOL 524' 'SCHOOL 176' 'SCHOOL 220' 'SCHOOL 447' 'SCHOOL 265'
 'SCHOOL 527' 'SCHOOL 462' 'SCHOOL 510' 'SCHOOL 533' 'SCHOOL 222'
 'SCHOOL 218' 'SCHOOL 425' 'SCHOOL 441' 'SCHOOL 332' 'SCHOOL 213'
 'SCHOOL 142' 'SCHOOL 160' 'SCHOOL 190' 'SCHOOL 344' 'SCHOOL 495'
 'SCHOOL 493' 'SCHOOL 382' 'SCHOOL 174' 'SCHOOL 385' 'SCHOOL 451'
 'SCHOOL 122' 'SCHOOL 473' 'SCHOOL 161' 'SCHOOL 363' 'SCHOOL 138'
 'SCHOOL 470' 'SCHOOL 497' 'SCHOOL 390' 'SCHOOL 354' 'SCHOOL 359'
 'SCHOOL 335' 'SCHOOL 90' 'SCHOOL 170' 'SCHOOL 251' 'SCHOOL 372'
 'SCHOOL 496' 'SCHOOL 169' 'SCHOOL 324' 'SCHOOL 488' 'SCHOOL 409'
 'SCHOOL 388' 'SCHOOL 215' 'SCHOOL 302' 'SCHOOL 167' 'SCHOOL 357'
 'SCHOOL 415' 'SCHOOL 436' 'SCHOOL 432' 'SCHOOL 231' 'SCHOOL 195'
 'SCHOOL 109' 'SCHOOL 201' 'SCHOOL 413' 'SCHOOL 526' 'SCHOOL 291'
 'SCHOOL 60' 'SCHOOL 101' 'SCHOOL 322' 'SCHOOL 420' 'SCHOOL 499'
 'SCHOOL 228' 'SCHOOL 212' 'SCHOOL 247' 'SCHOOL 476' 'SCHOOL 68'
 'SCHOOL 468' 'SCHOOL 254' 'SCHOOL 438' 'SCHOOL 379' 'SCHOOL 512'
 'SCHOOL 297' 'SCHOOL 481' 'SCHOOL 490' 'SCHOOL 77' 'SCHOOL 113'
 'SCHOOL 461' 'SCHOOL 358' 'SCHOOL 264' 'SCHOOL 517' 'SCHOOL 404'
 'SCHOOL 511' 'SCHOOL 418' 'SCHOOL 530' 'SCHOOL 444' 'SCHOOL 374'
 'SCHOOL 484' 'SCHOOL 214' 'SCHOOL 329' 'SCHOOL 525' 'SCHOOL 194'
 'SCHOOL 498' 'SCHOOL 458' 'SCHOOL 348' 'SCHOOL 509' 'SCHOOL 383'
 'SCHOOL 241' 'SCHOOL 411' 'SCHOOL 124' 'SCHOOL 320' 'SCHOOL 408'
 'SCHOOL 117' 'SCHOOL 206' 'SCHOOL 471' 'SCHOOL 349' 'SCHOOL 248'
 'SCHOOL 479' 'SCHOOL 391' 'SCHOOL 507' 'SCHOOL 105' 'SCHOOL 347'
 'SCHOOL 402' 'SCHOOL 494' 'SCHOOL 147' 'SCHOOL 387' 'SCHOOL 59'
 'SCHOOL 103' 'SCHOOL 430' 'SCHOOL 303' 'SCHOOL 187' 'SCHOOL 433'
 'SCHOOL 94' 'SCHOOL 102' 'SCHOOL 503' 'SCHOOL 172' 'SCHOOL 233'
 'SCHOOL 96' 'SCHOOL 475' 'SCHOOL 299' 'SCHOOL 58' 'SCHOOL 217'
 'SCHOOL 435' 'SCHOOL 380' 'SCHOOL 375' 'SCHOOL 392' 'SCHOOL 165'
 'SCHOOL 421' 'SCHOOL 368' 'SCHOOL 118' 'SCHOOL 439' 'SCHOOL 255'
 'SCHOOL 243' 'SCHOOL 450' 'SCHOOL 211' 'SCHOOL 437' 'SCHOOL 414']
====================================
GROSS_FIN_NEED : [      0  570000  835920  203040 1083000  567000  861720  849660 1165740
   15180  355200 1297440 1285860 1255740 1299360  806760 1360620 1146420
 1192380 1410900 1504980  431700 1346640 1324740 1418100 1333200 1324500
 1206780  699840 1187280  177960  625860  708900  471540 1316580  969180
  909960 1068720  798840 1029600  100440 1042560 1247160 1340880  163620
 1230600 1337460   38280 1171200  380820 1258200  709740  244140  988440
 1010160  842940  522300 1140720 1109460  232980 1145880  769860  706500
  221580  700320  830340  152460 1127160 1199340  819960 1489080  811500
  502140  670920 1177560  640440  154680  700440 1111140  951480  326940
 1117740  921360  982740  121020   37380  848520  858540  445560 1376280
  739500   73860  847440  116460  983160 1254720  696120 1289880  902220
 1167840  490500 1093200  521040 1176420  526740  798960  666900 1165920
  574380  389220  983220  827940  173400 1306020   88920 1098600 1149960
 1392240  626940  268680 1503900  655980  298620  462480  489900  882240
  564660 1272420 1305120  282960  411540  783180 1102500  381240  706200
  776340  664380  568800 1041720  194880  529920  201300  518460  807960
 1263600 1243200 1404600  896100 1091940  585720 1171380  469020 1128120
  924840 1043580  222900  717660 1076160  795240  373320  924000 1228320
 1029900  647100 1199820  625500  575040  306480  336240  794340  842220
  493260  144780  679260  467580 1316700 1255140  937200  740880 1230000
  438540  821580  580440  365940 1001280 1022460  485400 1201500 1138020
 1186860  728040   29820  571260 1017960  488880 1186200  291360 1015680
 1038540  803520 1300620 1245480  542640 1281120  891540 1206480  794400
 1122240  698040 1110720  329040  349860  873060 1158060  499020  717840
  289800 1201260  718200  532740 1176300  118740  593040 1410180 1409460
  626400  392220 1159620  744840 1225500  921540  553320  219240  311160
  809760 1049520   13920 1122120  853800  722640  802860   10020  864900
  196260  460320  898860 1283280  970080 1144200  872760  244320 1005540
  482520  855840 1034820 1096440  357900  774540  145560 1138980  358860
 1173540  353700 1254240  243720  876480  218160 1024800   82500  647160
 1296900  703440  595080  753840  666660  756300  848700  896280  225240
 1308360  217380  668940 1186500  695220  682620  845520  151920 1154640
  910980  780420 1134480  394380 1203120  607380  207660  916680  142980
  917880 1006860 1091040  254880  869760  573960  619440  566040 1329000
 1230900 1411800  897900 1076220  266700  813780  895320 1156200  661740
 1022520 1280820  118260 1131180 1140120  672120  257100  390960  355380
  370320  533100  577500  586440 1340940 1312260 1388100  175500  738300
 1075020  996000  809940  851220 1381320  770700 1040820   86220 1175880
  170640 1250160 1031520  632862  945300  695700  540480  260820 1010640
  572400  961200  872640 1052640  772140  788580  189420  419460 1222680
  461760 1084680  156660  930720  126780  603360  470160 1092720  990600
  198900 1297020 1187580  747180  984000  585120 1056120  855660 1246200
 1058640  407940   67620  516360 1236960  929400   18540  773460 1011660
  634380 1008000 1073220  156480  936180  951780  726900  978600  280140
 1300680  607560 1291440  735840  974040 1053480 1152540  984420  473880
  991380  229200  946560  580380 1151640 1056540 1315800 1410540  672540
 1417380  855960   43620 1190760  701520 1166100 1407840 1063080 1053600
  106380  562800  966300  606120 1357200  931980 1194540  905940  286980
 1241520  996480   35700  123540  727680  282780  533640  832920  825000
  516240 1252500  857580  772440 1257540  240060  298380  722340  559020
 1288860 1248360  447780  632040 1175400  436920   66420  913500  479280
  467340  791940  972360  740100 1056780  907380  816720  738480  228120
   36660  524280  562440 1165560  606840 1183080  301440  423180 1185960
  968460 1338540 1011780  785220 1012320 1416780  772500 1109340 1141800
 1096740  119520 1378080 1559220  682200  167940  141720   86580  576540
  254760 1180020 1325460 1933800  862140  250140 1314420 1230660  819600
 1200600  448140 1285920  162600  365520   89760 1206180  907860 1377780
  161880  516120  816480  745980  756720  703260  547920 1425300 1214880
  689760  184620  493500  479700  446640 1131600  578640 1352160 1130760
  298260  628200  227460 1350060 1233900  774960  237000 1172040  575220
  866820  988560  852060  347040 1197300  989400  520620 1183440 1299540
  895620 1413240  504420  875340 1016880  559620  304200  761760  766560
 1264860  551940  692760 1171800 1349880  845700  800100  864660 1407060
 1131960  596820 1141980  118140  673800  800400 1016760  492360  681480
  654660 1412520  394800 1025220  497640  702600 1354380  935040 1002540
  711720  490440  384900 1087320 1284780  871980  903300  802020 1276140
 1214100  484320  708720  164460 1160400 1111500   47220  189060 1160040
  567060  719340  861060  433740  720060  192180  959280  413460 1260000
 1109400 1038120  133260 1169820  507840  594240  729360  570780  258960
 1334940 1004820 1267380 1086060  875880  867060  840960  728640  771480
 1321020  271440  633480  784980  222540  504000 1341600 1144800  924900
 1073100  738960  161520 1304940  910320 1061040  747900 1009380 1336500
  568260  791880  858660  212280  176580 1396560 1032840  497220  285360
  650640  973320  203400  742800  722520  819720  934560  899640  854580
  478380  882060 1308480 1136220  261600  430380  301080  965040  461400
  625020 1064160  314640  733860  604380  460740  567120 1057860  836880
  953880  866040  846120 1093260 1488000  730740  621420  902760   91740
  598320   47100 1002480   28320  262140  704460 1226400 1011240 1032180
 1310100 1149420 1016040 1102380  484500  751980  945840  654600  454260
 1233420 1140780 1149180  389280 1175100 1189920 1203660  537000 1169700
 1412460   81120  414060 1355760 1165860  945420  856560  141900 1028520
 1140660  811140 1331940 1236840 1282860  266460 1010760  622440 1163220
 1225080  336960  185100 1190520  940500 1491660  582600  479640 1192440
  671700 1006800 1197900  576240  491700   25500  346200 1459020 1199100
  581340  424440  487500 1271820  372360 1145160  597420  114000  340200
  766680 1354740 1461780 1077480  594540  989340  467100  814260 1019220
 1130040  435000  269700  794640  941340  987000 1124940  970800  211140
  569220  440940  630540 1152120  276780   32640  796200 1260840 1257780
 1104060   42420   72840  231060 1017720 1073760 1322700 2124900 1175760
  550260  603660  159840  129540 1299180 1153380  860520  290880 1383900
 1102740  458460  974400  865620  813600   38820  204420 1213320  453660
  812220   25800  239760  339840 1102440 1072200  132780   53340 1135560
 1341360 1079820 1344420  572340  485040  331200 1317900  744420  754260
 1062480  493320  636300  268320    2880  403560  965400 1244580  939780
  645240  631020   22380  250920 1114020 1372560  864480  894060  324300
   14880  899580  233040  829380  565560  828000  355140 1029720  521700
   16860  698580  570180 1292340  891480  909480  665100 1182180 1213980
  798480 1716660  115140   53880 1319640 1316220 1037700  941760 1274580
 1079760 1028040 1141020  759540  491460 1110840  926280  907440  867900
  417900  787140  699300  299820 1009440 1103340 1157700 1049220  635220
  583620  427320  575880 1222080 1288080 1049760  610740  949800  911100
 1143780  444240  619200  866400  648960  689100  552960  831720  115500
 1058580   60180 1724820  123840  536700 1216080  308340  632940  244560]
====================================
COST_OF_ATTEND : [      0 1355760 1264860 1189920 1255740 1324500  835680 1176420  403680
 1151640 1410180 1146420  224640 1504980  199440 1297440  326880 1285860
 1418100  150540  301080  902760 1245480 1306020 1409460  451620 1346640
 1025700 1233900 1417380  336960  216480 1171380 1187280 1160520 1316580
 1201500  389520 1138980  299160  259680 1459020 1296900  949560  945840
 1489080  269100 1466460  134580  915000  983160  108960 1392240  112320
  326640 1010640  240120  939000  933000 1270680 1144200  360180  349860
  392220  339840  217920  357900  831720  860640  848700   99720  955500
 1309560  108240  840540  869760 1393560  861060  908160  852480  208560
 1236960  941700  894060 1071780 2082900 1025160  342720 1559220  369000
  278280  845940 1804860  588120 1474800  976320  733860  315960  352860
  331740  457560  417900  766680 1019760  372600 1019220  970800  120060
  905340 2124900  807420  830100  347580 1119300  392400  472980 1133940
 1432500  833160 1862040]
====================================
EST_FAM_CONTRIBUTION : [      0  785760  519840 ... 2177700 1505400 1288980]
====================================
UNMET_NEED : [      0.  459300.  278340. ...   34560.  -49860. -289140.]
====================================
In [25]:
len(demo4)
Out[25]:
8
In [26]:
# It shows sum of missing values for each variables
cf.isnull().sum()
Out[26]:
STUDENT IDENTIFIER             0
STDNT_AGE                      0
STDNT_GENDER                   0
STDNT_BACKGROUND               0
IN_STATE_FLAG                  0
INTERNATIONAL_STS              0
STDNT_MAJOR                    0
STDNT_MINOR                    0
STDNT_TEST_ENTRANCE1        2294
STDNT_TEST_ENTRANCE2         908
STDNT_TEST_ENTRANCE_COMB     518
FIRST_TERM                     0
CORE_COURSE_NAME_1_F           0
CORE_COURSE_GRADE_1_F          0
CORE_COURSE_NAME_2_F          99
CORE_COURSE_GRADE_2_F         99
CORE_COURSE_NAME_3_F         565
CORE_COURSE_GRADE_3_F        565
CORE_COURSE_NAME_4_F        1597
CORE_COURSE_GRADE_4_F       1597
CORE_COURSE_NAME_5_F        2755
CORE_COURSE_GRADE_5_F       2755
CORE_COURSE_NAME_6_F        3272
CORE_COURSE_GRADE_6_F       3272
SECOND_TERM                    0
CORE_COURSE_NAME_1_S         157
CORE_COURSE_GRADE_1_S        232
CORE_COURSE_NAME_2_S         439
CORE_COURSE_GRADE_2_S        439
CORE_COURSE_NAME_3_S        1038
CORE_COURSE_GRADE_3_S       1038
CORE_COURSE_NAME_4_S        2045
CORE_COURSE_GRADE_4_S       2045
CORE_COURSE_NAME_5_S        2950
CORE_COURSE_GRADE_5_S       2950
CORE_COURSE_NAME_6_S        3319
CORE_COURSE_GRADE_6_S       3319
HOUSING_STS                    0
RETURNED_2ND_YR                0
DISTANCE_FROM_HOME            25
HIGH_SCHL_GPA                 53
HIGH_SCHL_NAME                 1
FATHER_HI_EDU_CD             432
FATHER_HI_EDU_DESC             0
MOTHER_HI_EDU_CD             489
MOTHER_HI_EDU_DESC             0
DEGREE_GROUP_CD                0
DEGREE_GROUP_DESC              0
FIRST_TERM_ATTEMPT_HRS         0
FIRST_TERM_EARNED_HRS          0
SECOND_TERM_ATTEMPT_HRS      206
SECOND_TERM_EARNED_HRS       209
GROSS_FIN_NEED                 0
COST_OF_ATTEND                 0
EST_FAM_CONTRIBUTION           0
UNMET_NEED                     0
dtype: int64
In [27]:
#It is group data with two different variables named under att
att=cf.groupby('RETURNED_2ND_YR')
In [28]:
# This shows the data type of that column
type('RETURNED_2ND_YR')
Out[28]:
str
In [29]:
#it shows mean of all the variables wrt to 0 and 1
att.mean()
Out[29]:
STUDENT IDENTIFIER STDNT_AGE STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 STDNT_TEST_ENTRANCE_COMB FIRST_TERM SECOND_TERM DISTANCE_FROM_HOME HIGH_SCHL_GPA FATHER_HI_EDU_CD MOTHER_HI_EDU_CD FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
RETURNED_2ND_YR
0 7.899963e+06 18.015214 20.550000 970.705882 999.403974 200787.391425 200881.391425 112.214586 3.131802 2.527419 2.535000 13.988935 11.809129 13.906250 11.466912 307243.128631 549032.199170 315265.643154 90679.460581
1 7.896397e+06 17.988420 20.690531 963.647830 997.322212 200774.193500 200868.193500 96.506762 3.221768 2.569421 2.580701 13.985058 12.311916 14.363774 12.725727 302365.237206 552209.069854 352636.787449 67235.153455

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types

In [30]:
cf.describe()
Out[30]:
STUDENT IDENTIFIER STDNT_AGE STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 STDNT_TEST_ENTRANCE_COMB FIRST_TERM SECOND_TERM RETURNED_2ND_YR DISTANCE_FROM_HOME HIGH_SCHL_GPA FATHER_HI_EDU_CD MOTHER_HI_EDU_CD FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
count 3.400000e+03 3400.000000 1106.000000 2492.000000 2882.000000 3400.000000 3400.000000 3400.000000 3375.000000 3347.000000 2968.000000 2911.000000 3400.000000 3400.000000 3194.000000 3191.000000 3.400000e+03 3.400000e+03 3.400000e+03 3.400000e+03
mean 7.897155e+06 17.994118 20.660036 965.092295 997.758501 200777.000000 200871.000000 0.787353 99.825185 3.202738 2.560647 2.571281 13.985882 12.205000 14.285848 12.511125 3.034025e+05 5.515335e+05 3.446899e+05 7.222052e+04
std 8.083421e+04 0.552052 3.234734 209.088633 156.268221 170.001298 170.001298 0.409240 235.196984 0.455337 0.681273 0.611729 1.554286 2.986294 2.104387 3.470492 4.836255e+05 6.064403e+05 7.871878e+05 2.795679e+05
min 7.755837e+06 16.000000 8.000000 350.000000 530.000000 200508.000000 200602.000000 0.000000 0.000000 0.000000 1.000000 1.000000 9.000000 0.000000 2.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 -1.212072e+06
25% 7.827039e+06 18.000000 18.000000 900.000000 910.000000 200608.000000 200702.000000 1.000000 69.000000 2.870000 2.000000 2.000000 13.000000 11.000000 13.000000 11.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 7.899052e+06 18.000000 20.000000 990.000000 990.000000 200808.000000 200902.000000 1.000000 69.000000 3.200000 3.000000 3.000000 14.000000 13.000000 14.000000 13.000000 0.000000e+00 1.505400e+05 0.000000e+00 0.000000e+00
75% 7.964008e+06 18.000000 22.000000 1100.000000 1110.000000 200908.000000 201002.000000 1.000000 138.000000 3.560000 3.000000 3.000000 15.000000 15.000000 16.000000 15.000000 5.995800e+05 1.192815e+06 3.241800e+05 1.876155e+05
max 8.037098e+06 26.000000 32.000000 1490.000000 1510.000000 201008.000000 201102.000000 1.000000 5932.000000 4.000000 4.000000 4.000000 21.000000 21.000000 23.000000 23.000000 2.124900e+06 2.124900e+06 5.999940e+06 1.632660e+06
In [31]:
# .value counts shows the counts of those variables like It shows number of Female and male students here

cf['STDNT_GENDER'].value_counts()
Out[31]:
F    2140
M    1260
Name: STDNT_GENDER, dtype: int64
In [32]:
#  It is used to group the data to see number of students left and returned
att=cf.groupby('RETURNED_2ND_YR')
att_count=cf.groupby('RETURNED_2ND_YR').count()
In [33]:
# it shows the number of left students under different categories
att_count
Out[33]:
STUDENT IDENTIFIER STDNT_AGE STDNT_GENDER STDNT_BACKGROUND IN_STATE_FLAG INTERNATIONAL_STS STDNT_MAJOR STDNT_MINOR STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 ... DEGREE_GROUP_CD DEGREE_GROUP_DESC FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
RETURNED_2ND_YR
0 723 723 723 723 723 723 723 723 240 510 ... 723 723 723 723 544 544 723 723 723 723
1 2677 2677 2677 2677 2677 2677 2677 2677 866 1982 ... 2677 2677 2677 2677 2650 2647 2677 2677 2677 2677

2 rows × 55 columns

In [34]:
# It will plot the bar chart to see how many students left due to distane from home

import matplotlib.pyplot as plt
In [35]:
plt.bar(att_count.index.values,att_count['DISTANCE_FROM_HOME'])
plt.xlabel('Students left College')
plt.ylabel('Number of students')
plt.title ('DISTANCE_FROM_HOME')
plt.show()
In [36]:
# It gives count for number of students left and return
cf.RETURNED_2ND_YR.value_counts()
Out[36]:
1    2677
0     723
Name: RETURNED_2ND_YR, dtype: int64
In [37]:
# Its new data frame to see all the students who left
cf_new=cf.query("RETURNED_2ND_YR=='0'")
In [38]:
# It shows first 10 left students data
cf_new.head(10)
Out[38]:
STUDENT IDENTIFIER STDNT_AGE STDNT_GENDER STDNT_BACKGROUND IN_STATE_FLAG INTERNATIONAL_STS STDNT_MAJOR STDNT_MINOR STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 ... DEGREE_GROUP_CD DEGREE_GROUP_DESC FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
0 7808615 18 F BGD 1 Y N Undeclared N NaN 1150.0 ... B Bachelors 16 16 14.0 14.0 0 0 0 0.0
11 7884573 18 F BGD 1 Y N Undeclared N NaN 1050.0 ... B Bachelors 16 16 16.0 16.0 0 1189920 5163240 -326880.0
17 8004193 17 M BGD 1 N N Biology N NaN 1040.0 ... B Bachelors 14 14 14.0 14.0 1083000 1324500 241500 476580.0
23 8007842 18 F BGD 1 Y N Pre-Nursing N NaN 580.0 ... B Bachelors 15 15 15.0 15.0 0 0 0 0.0
36 7799972 18 F BGD 1 Y N Undeclared N NaN 620.0 ... B Bachelors 18 18 14.0 14.0 0 0 0 0.0
45 7911769 18 F BGD 1 Y N Psychology N NaN 1130.0 ... B Bachelors 15 15 15.0 15.0 355200 1176420 821220 56040.0
56 7889645 18 F BGD 1 Y N Early Childhood Education N 26.0 NaN ... B Bachelors 14 10 12.0 12.0 0 0 0 0.0
73 7807583 18 F BGD 6 Y N Undeclared N NaN 1360.0 ... B Bachelors 16 16 15.0 15.0 0 0 0 0.0
84 7935064 18 F BGD 1 Y N Undeclared N 27.0 NaN ... B Bachelors 15 15 16.0 16.0 0 0 0 0.0
85 7805566 17 F BGD 1 Y N Undeclared N NaN 1150.0 ... B Bachelors 13 13 12.0 12.0 0 1264860 1471380 -347580.0

10 rows × 56 columns

In [39]:
# It shows missing values for new data frame for left students
cf_new.isnull().sum()
Out[39]:
STUDENT IDENTIFIER            0
STDNT_AGE                     0
STDNT_GENDER                  0
STDNT_BACKGROUND              0
IN_STATE_FLAG                 0
INTERNATIONAL_STS             0
STDNT_MAJOR                   0
STDNT_MINOR                   0
STDNT_TEST_ENTRANCE1        483
STDNT_TEST_ENTRANCE2        213
STDNT_TEST_ENTRANCE_COMB    119
FIRST_TERM                    0
CORE_COURSE_NAME_1_F          0
CORE_COURSE_GRADE_1_F         0
CORE_COURSE_NAME_2_F         20
CORE_COURSE_GRADE_2_F        20
CORE_COURSE_NAME_3_F        116
CORE_COURSE_GRADE_3_F       116
CORE_COURSE_NAME_4_F        330
CORE_COURSE_GRADE_4_F       330
CORE_COURSE_NAME_5_F        594
CORE_COURSE_GRADE_5_F       594
CORE_COURSE_NAME_6_F        694
CORE_COURSE_GRADE_6_F       694
SECOND_TERM                   0
CORE_COURSE_NAME_1_S        125
CORE_COURSE_GRADE_1_S       189
CORE_COURSE_NAME_2_S        221
CORE_COURSE_GRADE_2_S       221
CORE_COURSE_NAME_3_S        315
CORE_COURSE_GRADE_3_S       315
CORE_COURSE_NAME_4_S        489
CORE_COURSE_GRADE_4_S       489
CORE_COURSE_NAME_5_S        652
CORE_COURSE_GRADE_5_S       652
CORE_COURSE_NAME_6_S        710
CORE_COURSE_GRADE_6_S       710
HOUSING_STS                   0
RETURNED_2ND_YR               0
DISTANCE_FROM_HOME           10
HIGH_SCHL_GPA                15
HIGH_SCHL_NAME                0
FATHER_HI_EDU_CD            103
FATHER_HI_EDU_DESC            0
MOTHER_HI_EDU_CD            123
MOTHER_HI_EDU_DESC            0
DEGREE_GROUP_CD               0
DEGREE_GROUP_DESC             0
FIRST_TERM_ATTEMPT_HRS        0
FIRST_TERM_EARNED_HRS         0
SECOND_TERM_ATTEMPT_HRS     179
SECOND_TERM_EARNED_HRS      179
GROSS_FIN_NEED                0
COST_OF_ATTEND                0
EST_FAM_CONTRIBUTION          0
UNMET_NEED                    0
dtype: int64
In [40]:
# It gives information of what are data types..int , float object
cf_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 723 entries, 0 to 3399
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   STUDENT IDENTIFIER        723 non-null    int64  
 1   STDNT_AGE                 723 non-null    int64  
 2   STDNT_GENDER              723 non-null    object 
 3   STDNT_BACKGROUND          723 non-null    object 
 4   IN_STATE_FLAG             723 non-null    object 
 5   INTERNATIONAL_STS         723 non-null    object 
 6   STDNT_MAJOR               723 non-null    object 
 7   STDNT_MINOR               723 non-null    object 
 8   STDNT_TEST_ENTRANCE1      240 non-null    float64
 9   STDNT_TEST_ENTRANCE2      510 non-null    float64
 10  STDNT_TEST_ENTRANCE_COMB  604 non-null    float64
 11  FIRST_TERM                723 non-null    int64  
 12  CORE_COURSE_NAME_1_F      723 non-null    object 
 13  CORE_COURSE_GRADE_1_F     723 non-null    object 
 14  CORE_COURSE_NAME_2_F      703 non-null    object 
 15  CORE_COURSE_GRADE_2_F     703 non-null    object 
 16  CORE_COURSE_NAME_3_F      607 non-null    object 
 17  CORE_COURSE_GRADE_3_F     607 non-null    object 
 18  CORE_COURSE_NAME_4_F      393 non-null    object 
 19  CORE_COURSE_GRADE_4_F     393 non-null    object 
 20  CORE_COURSE_NAME_5_F      129 non-null    object 
 21  CORE_COURSE_GRADE_5_F     129 non-null    object 
 22  CORE_COURSE_NAME_6_F      29 non-null     object 
 23  CORE_COURSE_GRADE_6_F     29 non-null     object 
 24  SECOND_TERM               723 non-null    int64  
 25  CORE_COURSE_NAME_1_S      598 non-null    object 
 26  CORE_COURSE_GRADE_1_S     534 non-null    object 
 27  CORE_COURSE_NAME_2_S      502 non-null    object 
 28  CORE_COURSE_GRADE_2_S     502 non-null    object 
 29  CORE_COURSE_NAME_3_S      408 non-null    object 
 30  CORE_COURSE_GRADE_3_S     408 non-null    object 
 31  CORE_COURSE_NAME_4_S      234 non-null    object 
 32  CORE_COURSE_GRADE_4_S     234 non-null    object 
 33  CORE_COURSE_NAME_5_S      71 non-null     object 
 34  CORE_COURSE_GRADE_5_S     71 non-null     object 
 35  CORE_COURSE_NAME_6_S      13 non-null     object 
 36  CORE_COURSE_GRADE_6_S     13 non-null     object 
 37  HOUSING_STS               723 non-null    object 
 38  RETURNED_2ND_YR           723 non-null    int64  
 39  DISTANCE_FROM_HOME        713 non-null    float64
 40  HIGH_SCHL_GPA             708 non-null    float64
 41  HIGH_SCHL_NAME            723 non-null    object 
 42  FATHER_HI_EDU_CD          620 non-null    float64
 43  FATHER_HI_EDU_DESC        723 non-null    object 
 44  MOTHER_HI_EDU_CD          600 non-null    float64
 45  MOTHER_HI_EDU_DESC        723 non-null    object 
 46  DEGREE_GROUP_CD           723 non-null    object 
 47  DEGREE_GROUP_DESC         723 non-null    object 
 48  FIRST_TERM_ATTEMPT_HRS    723 non-null    int64  
 49  FIRST_TERM_EARNED_HRS     723 non-null    int64  
 50  SECOND_TERM_ATTEMPT_HRS   544 non-null    float64
 51  SECOND_TERM_EARNED_HRS    544 non-null    float64
 52  GROSS_FIN_NEED            723 non-null    int64  
 53  COST_OF_ATTEND            723 non-null    int64  
 54  EST_FAM_CONTRIBUTION      723 non-null    int64  
 55  UNMET_NEED                723 non-null    float64
dtypes: float64(10), int64(10), object(36)
memory usage: 322.0+ KB
In [41]:
# It will put the missing values in descending order with their percentage for each variable
drp_col=pd.DataFrame(round((cf_new.isnull().sum()/cf_new.shape[0])*100,2).sort_values(ascending=False)).reset_index()
In [42]:
# It will show number of rows and columns for left std data frame
cf_new.shape
Out[42]:
(723, 56)
In [43]:
# It shows the percentage of missing values in descending orders
drp_col
Out[43]:
index 0
0 CORE_COURSE_NAME_6_S 98.20
1 CORE_COURSE_GRADE_6_S 98.20
2 CORE_COURSE_NAME_6_F 95.99
3 CORE_COURSE_GRADE_6_F 95.99
4 CORE_COURSE_NAME_5_S 90.18
5 CORE_COURSE_GRADE_5_S 90.18
6 CORE_COURSE_NAME_5_F 82.16
7 CORE_COURSE_GRADE_5_F 82.16
8 CORE_COURSE_NAME_4_S 67.63
9 CORE_COURSE_GRADE_4_S 67.63
10 STDNT_TEST_ENTRANCE1 66.80
11 CORE_COURSE_NAME_4_F 45.64
12 CORE_COURSE_GRADE_4_F 45.64
13 CORE_COURSE_NAME_3_S 43.57
14 CORE_COURSE_GRADE_3_S 43.57
15 CORE_COURSE_GRADE_2_S 30.57
16 CORE_COURSE_NAME_2_S 30.57
17 STDNT_TEST_ENTRANCE2 29.46
18 CORE_COURSE_GRADE_1_S 26.14
19 SECOND_TERM_EARNED_HRS 24.76
20 SECOND_TERM_ATTEMPT_HRS 24.76
21 CORE_COURSE_NAME_1_S 17.29
22 MOTHER_HI_EDU_CD 17.01
23 STDNT_TEST_ENTRANCE_COMB 16.46
24 CORE_COURSE_NAME_3_F 16.04
25 CORE_COURSE_GRADE_3_F 16.04
26 FATHER_HI_EDU_CD 14.25
27 CORE_COURSE_NAME_2_F 2.77
28 CORE_COURSE_GRADE_2_F 2.77
29 HIGH_SCHL_GPA 2.07
30 DISTANCE_FROM_HOME 1.38
31 FIRST_TERM 0.00
32 CORE_COURSE_GRADE_1_F 0.00
33 STDNT_MINOR 0.00
34 STDNT_BACKGROUND 0.00
35 CORE_COURSE_NAME_1_F 0.00
36 STDNT_GENDER 0.00
37 STDNT_AGE 0.00
38 INTERNATIONAL_STS 0.00
39 STDNT_MAJOR 0.00
40 IN_STATE_FLAG 0.00
41 UNMET_NEED 0.00
42 SECOND_TERM 0.00
43 EST_FAM_CONTRIBUTION 0.00
44 HOUSING_STS 0.00
45 RETURNED_2ND_YR 0.00
46 HIGH_SCHL_NAME 0.00
47 FATHER_HI_EDU_DESC 0.00
48 MOTHER_HI_EDU_DESC 0.00
49 DEGREE_GROUP_CD 0.00
50 DEGREE_GROUP_DESC 0.00
51 FIRST_TERM_ATTEMPT_HRS 0.00
52 FIRST_TERM_EARNED_HRS 0.00
53 GROSS_FIN_NEED 0.00
54 COST_OF_ATTEND 0.00
55 STUDENT IDENTIFIER 0.00
In [44]:
## it will drop the coloumns first 20 rows, as these are having high number of misssing values, which will bias our analysis
c=drp_col.nlargest(19,0)['index'].tolist()
In [45]:
# It will show number of rows and columns
drp_col.shape
Out[45]:
(56, 2)
In [46]:
# It will create new data frame for selected data, here we have dropped columns which are above 25%
drop=drp_col[0:18]

It shows the data that we are not using for analysis We have dropped them as many are categorical & also not directly relevant for the analysis

In [47]:
drop
Out[47]:
index 0
0 CORE_COURSE_NAME_6_S 98.20
1 CORE_COURSE_GRADE_6_S 98.20
2 CORE_COURSE_NAME_6_F 95.99
3 CORE_COURSE_GRADE_6_F 95.99
4 CORE_COURSE_NAME_5_S 90.18
5 CORE_COURSE_GRADE_5_S 90.18
6 CORE_COURSE_NAME_5_F 82.16
7 CORE_COURSE_GRADE_5_F 82.16
8 CORE_COURSE_NAME_4_S 67.63
9 CORE_COURSE_GRADE_4_S 67.63
10 STDNT_TEST_ENTRANCE1 66.80
11 CORE_COURSE_NAME_4_F 45.64
12 CORE_COURSE_GRADE_4_F 45.64
13 CORE_COURSE_NAME_3_S 43.57
14 CORE_COURSE_GRADE_3_S 43.57
15 CORE_COURSE_GRADE_2_S 30.57
16 CORE_COURSE_NAME_2_S 30.57
17 STDNT_TEST_ENTRANCE2 29.46
In [48]:
# new data frame created by name vp for drop by Name
vp=cf_new.drop(drop['index'], axis=1)

This data frame is only for the student who left college

In [49]:
# It shows the number of columns under data frame vp
vp.columns
Out[49]:
Index(['STUDENT IDENTIFIER', 'STDNT_AGE', 'STDNT_GENDER', 'STDNT_BACKGROUND',
       'IN_STATE_FLAG', 'INTERNATIONAL_STS', 'STDNT_MAJOR', 'STDNT_MINOR',
       'STDNT_TEST_ENTRANCE_COMB', 'FIRST_TERM', 'CORE_COURSE_NAME_1_F',
       'CORE_COURSE_GRADE_1_F', 'CORE_COURSE_NAME_2_F',
       'CORE_COURSE_GRADE_2_F', 'CORE_COURSE_NAME_3_F',
       'CORE_COURSE_GRADE_3_F', 'SECOND_TERM', 'CORE_COURSE_NAME_1_S',
       'CORE_COURSE_GRADE_1_S', 'HOUSING_STS', 'RETURNED_2ND_YR',
       'DISTANCE_FROM_HOME', 'HIGH_SCHL_GPA', 'HIGH_SCHL_NAME',
       'FATHER_HI_EDU_CD', 'FATHER_HI_EDU_DESC', 'MOTHER_HI_EDU_CD',
       'MOTHER_HI_EDU_DESC', 'DEGREE_GROUP_CD', 'DEGREE_GROUP_DESC',
       'FIRST_TERM_ATTEMPT_HRS', 'FIRST_TERM_EARNED_HRS',
       'SECOND_TERM_ATTEMPT_HRS', 'SECOND_TERM_EARNED_HRS', 'GROSS_FIN_NEED',
       'COST_OF_ATTEND', 'EST_FAM_CONTRIBUTION', 'UNMET_NEED'],
      dtype='object')
In [50]:
vp.shape
Out[50]:
(723, 38)
In [ ]:
 
In [51]:
# It shows Gender of left students

vp['STDNT_GENDER'].value_counts()
Out[51]:
F    475
M    248
Name: STDNT_GENDER, dtype: int64
In [52]:
#profile report for the Entire Data
import pandas as pd,os,numpy as np,pandas_profiling as pf
pf.ProfileReport(cf)



Out[52]:

In [53]:
# It will show null values for all columns in vp data frame
vp.isnull().sum()
Out[53]:
STUDENT IDENTIFIER            0
STDNT_AGE                     0
STDNT_GENDER                  0
STDNT_BACKGROUND              0
IN_STATE_FLAG                 0
INTERNATIONAL_STS             0
STDNT_MAJOR                   0
STDNT_MINOR                   0
STDNT_TEST_ENTRANCE_COMB    119
FIRST_TERM                    0
CORE_COURSE_NAME_1_F          0
CORE_COURSE_GRADE_1_F         0
CORE_COURSE_NAME_2_F         20
CORE_COURSE_GRADE_2_F        20
CORE_COURSE_NAME_3_F        116
CORE_COURSE_GRADE_3_F       116
SECOND_TERM                   0
CORE_COURSE_NAME_1_S        125
CORE_COURSE_GRADE_1_S       189
HOUSING_STS                   0
RETURNED_2ND_YR               0
DISTANCE_FROM_HOME           10
HIGH_SCHL_GPA                15
HIGH_SCHL_NAME                0
FATHER_HI_EDU_CD            103
FATHER_HI_EDU_DESC            0
MOTHER_HI_EDU_CD            123
MOTHER_HI_EDU_DESC            0
DEGREE_GROUP_CD               0
DEGREE_GROUP_DESC             0
FIRST_TERM_ATTEMPT_HRS        0
FIRST_TERM_EARNED_HRS         0
SECOND_TERM_ATTEMPT_HRS     179
SECOND_TERM_EARNED_HRS      179
GROSS_FIN_NEED                0
COST_OF_ATTEND                0
EST_FAM_CONTRIBUTION          0
UNMET_NEED                    0
dtype: int64

Imputing missing values

In [54]:
# Replace the null values with mean value
vp['STDNT_TEST_ENTRANCE_COMB'].fillna(999,inplace=True)
In [55]:
#vp['CORE_COURSE_NAME_2_F'].fillna('ENG 1101',inplace=True)
vp['CORE_COURSE_NAME_2_F'].fillna(vp['CORE_COURSE_NAME_2_F'].mode().iloc[0],inplace=True)
In [56]:
#vp['CORE_COURSE_GRADE_2_F'].fillna('B',inplace=True)
vp['CORE_COURSE_GRADE_2_F'].fillna(vp['CORE_COURSE_GRADE_2_F'].mode().iloc[0],inplace=True)
In [57]:
#vp['CORE_COURSE_NAME_3_F'].fillna('ENG 1101',inplace=True)
#vp['CORE_COURSE_GRADE_3_F'].fillna('B',inplace=True)
In [58]:
# Replace the null values with mode value
vp['CORE_COURSE_NAME_3_F'].fillna(vp['CORE_COURSE_NAME_3_F'].mode().iloc[0],inplace=True)
vp['CORE_COURSE_GRADE_3_F'].fillna(vp['CORE_COURSE_GRADE_3_F'].mode().iloc[0],inplace=True)
In [59]:
# Replace the null values with mean value 
vp['DISTANCE_FROM_HOME'].fillna(112,inplace=True)
In [60]:
#vp.drop(columns='CORE_COURSE_NAME_1_S',axis=1,inplace=True)
In [61]:
# Replace the null values with mean value
vp['HIGH_SCHL_GPA'].fillna(3,inplace=True)
In [62]:
# It will fill the null values with mean
# vp drop --it will drop the coloumn which is not relevant for our analysis
vp['FATHER_HI_EDU_CD'].fillna(4,inplace=True)
vp.drop(columns=['FATHER_HI_EDU_DESC'],axis=1,inplace=True)
In [63]:
# It will fill the null values with mean
# vp drop --it will drop the coloumn which is not relevant for our analysis
vp['MOTHER_HI_EDU_CD'].fillna(4,inplace=True)
vp.drop(columns=['MOTHER_HI_EDU_DESC'],axis=1,inplace=True)
In [64]:
# vp drop --it will drop the coloumn & we are dropping it as is not required for our analysis
vp.drop(columns='DEGREE_GROUP_DESC',axis=1,inplace=True)
In [65]:
#vp['CORE_COURSE_GRADE_1_F'].replace('INCOMPL','NOT REP',inplace=True)
In [66]:
# It will fill the null values with mean
# vp drop --it will drop the coloumn which is not relevant for our analysis

vp['DISTANCE_FROM_HOME'].fillna(112,inplace=True)
vp.drop(columns=['SECOND_TERM_ATTEMPT_HRS'],axis=1,inplace=True)
In [67]:
vp.isnull().sum()
Out[67]:
STUDENT IDENTIFIER            0
STDNT_AGE                     0
STDNT_GENDER                  0
STDNT_BACKGROUND              0
IN_STATE_FLAG                 0
INTERNATIONAL_STS             0
STDNT_MAJOR                   0
STDNT_MINOR                   0
STDNT_TEST_ENTRANCE_COMB      0
FIRST_TERM                    0
CORE_COURSE_NAME_1_F          0
CORE_COURSE_GRADE_1_F         0
CORE_COURSE_NAME_2_F          0
CORE_COURSE_GRADE_2_F         0
CORE_COURSE_NAME_3_F          0
CORE_COURSE_GRADE_3_F         0
SECOND_TERM                   0
CORE_COURSE_NAME_1_S        125
CORE_COURSE_GRADE_1_S       189
HOUSING_STS                   0
RETURNED_2ND_YR               0
DISTANCE_FROM_HOME            0
HIGH_SCHL_GPA                 0
HIGH_SCHL_NAME                0
FATHER_HI_EDU_CD              0
MOTHER_HI_EDU_CD              0
DEGREE_GROUP_CD               0
FIRST_TERM_ATTEMPT_HRS        0
FIRST_TERM_EARNED_HRS         0
SECOND_TERM_EARNED_HRS      179
GROSS_FIN_NEED                0
COST_OF_ATTEND                0
EST_FAM_CONTRIBUTION          0
UNMET_NEED                    0
dtype: int64
In [68]:
vp.shape
Out[68]:
(723, 34)

Plotting the GRAPHS for all variables for further analysis

We are seeking the analysis for the Students who left after first year causing Attrition

In [69]:
# It shows Bar graph for Number of students left w.r.t Distance from home
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.xlim(min(vp['DISTANCE_FROM_HOME']),200)
plt.hist(vp['DISTANCE_FROM_HOME'],bins='auto')
plt.title('DISTANCE_FROM_HOME')
plt.xlabel('DISTANCE_FROM_HOME')
plt.ylabel('No of student')
plt.show()

Above plot shows maximum number ofnstudents who left college are staying between 60 to 80 Km away from college

In [70]:
# It shows Bar graph for Number of students left w.r.t Distance from home
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.xlim(min(cf['DISTANCE_FROM_HOME']),200)
plt.hist(cf['DISTANCE_FROM_HOME'],bins='auto')
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.title('DISTANCE_FROM_HOME')
plt.xlabel('DISTANCE_FROM_HOME')
plt.ylabel('No of student')
plt.show()
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:391: RuntimeWarning: invalid value encountered in greater_equal
  keep = (a >= first_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:392: RuntimeWarning: invalid value encountered in less_equal
  keep &= (a <= last_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
Above plot shows maximum number of students who left college are staying between 60 to 80 Km away from college
In [71]:
sns.boxplot(data=vp['DISTANCE_FROM_HOME'])
plt.title('DISTANCE_FROM_HOME')
plt.ylabel('No of student')
Out[71]:
Text(0, 0.5, 'No of student')
In [72]:
#check percentiles for some columns
quan=[]
for i in list(range(1,101)):
    quan.append(i/100)
print(vp['DISTANCE_FROM_HOME'].quantile(q=quan).tail())
0.96     235.00
0.97     277.72
0.98     374.96
0.99     838.22
1.00    4501.00
Name: DISTANCE_FROM_HOME, dtype: float64
In [73]:
#Inorder to treat the outliers here & Since less than 1% values are there, we'll drop them for analysis
#raw[[raw["DISTANCE_FROM_HOME"]<=275 | raw["STDNT_AGE"]<=20]]
#raw1=raw.query("DISTANCE_FROM_HOME <= 275 & STDNT_AGE <= 20")
In [74]:
# UNMET_NEED--among students left
plt.xlim(min(vp['UNMET_NEED']),900000)
plt.hist(vp['UNMET_NEED'],bins=10)
plt.title('UNMET_NEED')
plt.xlabel('UNMET_NEED')
plt.ylabel('No of student')
plt.show()
In [75]:
# UNMET_NEED
plt.xlim(min(cf['UNMET_NEED']),900000)
plt.hist(cf['UNMET_NEED'],bins=10)
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.title('UNMET_NEED')
plt.xlabel('UNMET_NEED')
plt.ylabel('No of student')
plt.show()
In [76]:
sns.boxplot(data=cf['UNMET_NEED'])
plt.title('UNMET_NEED')
plt.ylabel('UNMET_NEED')
Out[76]:
Text(0, 0.5, 'UNMET_NEED')
In [77]:
plt.style.use('ggplot')
plt.xlim(min(vp['EST_FAM_CONTRIBUTION']),800000)
plt.hist(vp['EST_FAM_CONTRIBUTION'],bins='auto')
plt.title('EST_FAM_CONTRIBUTION')
plt.xlabel('EST_FAM_CONTRIBUTION')
plt.ylabel('No of student')
plt.show()
In [78]:
plt.style.use('ggplot')
plt.xlim(min(cf['EST_FAM_CONTRIBUTION']),800000)
plt.hist(vp['EST_FAM_CONTRIBUTION'])
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.title('EST_FAM_CONTRIBUTION')
plt.xlabel('EST_FAM_CONTRIBUTION')
plt.ylabel('No of student')
plt.show()

Above plot shows Maximum students who left are because of low Family Contribution

In [79]:
sns.boxplot(data=vp['EST_FAM_CONTRIBUTION'])
plt.title('EST_FAM_CONTRIBUTION')
plt.ylabel('EST_FAM_CONTRIBUTION')
Out[79]:
Text(0, 0.5, 'EST_FAM_CONTRIBUTION')
In [80]:
plt.style.use('ggplot')
plt.xlim(min(vp['COST_OF_ATTEND']),900000)
plt.hist(vp['COST_OF_ATTEND'],bins='auto')
plt.title('COST_OF_ATTEND')
plt.xlabel('COST_OF_ATTEND')
plt.ylabel('No of student')
plt.show()
In [81]:
plt.style.use('ggplot')
plt.xlim(min(cf['COST_OF_ATTEND']),900000)
plt.hist(cf['COST_OF_ATTEND'],bins='auto')
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.title('COST_OF_ATTEND')
plt.xlabel('COST_OF_ATTEND')
plt.ylabel('No of student')
plt.show()
In [82]:
sns.boxplot(data=vp['COST_OF_ATTEND'])
plt.title('COST_OF_ATTEND')
plt.ylabel('COST_OF_ATTEND')
Out[82]:
Text(0, 0.5, 'COST_OF_ATTEND')
In [83]:
plt.style.use('ggplot')
plt.xlim(min(vp['GROSS_FIN_NEED']),900000)
plt.hist(vp['GROSS_FIN_NEED'],bins='auto')
plt.title('GROSS_FIN_NEED')
plt.xlabel('GROSS_FIN_NEED')
plt.ylabel('No of student')
plt.show()
In [84]:
plt.style.use('ggplot')
plt.xlim(min(cf['GROSS_FIN_NEED']),900000)
plt.ylim(min(cf['GROSS_FIN_NEED']),3000)
plt.hist(cf['GROSS_FIN_NEED'],bins='auto')
cf[cf["GROSS_FIN_NEED"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["GROSS_FIN_NEED"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.title('GROSS_FIN_NEED')
plt.xlabel('GROSS_FIN_NEED')
plt.ylabel('No of student')
plt.show()
In [85]:
sns.boxplot(data=vp['GROSS_FIN_NEED'])
plt.title('GROSS_FIN_NEED')
plt.ylabel('GROSS_FIN_NEED')
Out[85]:
Text(0, 0.5, 'GROSS_FIN_NEED')
In [86]:
plt.hist(cf['FATHER_HI_EDU_CD'])
plt.title('FATHER_HI_EDU_CD')
plt.xlabel('FATHER_HI_EDU_CD')
plt.ylabel('No of student')
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.show()
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [87]:
sns.boxplot(data=cf['FATHER_HI_EDU_CD'])
plt.title('FATHER_HI_EDU_CD')
plt.xlabel('FATHER_HI_EDU_CD')
plt.ylabel('No of student')
Out[87]:
Text(0, 0.5, 'No of student')
In [88]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

x = np.random.randn(100)

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, 
                                    gridspec_kw={"height_ratios": (.15, .85)})

sns.boxplot(data=vp['FATHER_HI_EDU_CD'])
sns.distplot(vp['FATHER_HI_EDU_CD'])

ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
In [89]:
plt.hist(vp['MOTHER_HI_EDU_CD'])
plt.title('MOTHER_HI_EDU_CD')
plt.xlabel('MOTHER_HI_EDU_CD')
plt.ylabel('No of student')
plt.show()
In [90]:
plt.hist(cf['MOTHER_HI_EDU_CD'])
plt.title('MOTHER_HI_EDU_CD')
plt.xlabel('MOTHER_HI_EDU_CD')
plt.ylabel('No of student')
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.show()
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [91]:
sns.boxplot(data=vp['MOTHER_HI_EDU_CD'])
plt.title('MOTHER_HI_EDU_CD')
plt.xlabel('MOTHER_HI_EDU_CD')
plt.ylabel('No of student')
Out[91]:
Text(0, 0.5, 'No of student')
In [92]:
plt.hist(cf['HIGH_SCHL_GPA'])
plt.title('HIGH_SCHL_GPA')
plt.xlabel('GPA score')
plt.ylabel('No of student')
cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=5,color='red',label='Attrited',alpha=1.0)
cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=5,color='orange',label='Retained',alpha=0.5)
plt.legend()
plt.show()
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
D:\Anaconda\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [93]:
sns.boxplot(data=cf['HIGH_SCHL_GPA'])
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f29f9c0f0>
In [94]:
vp.hist(figsize=(20,20))
plt.show()
In [95]:
LL=cf.columns.to_list()
In [96]:
len (LL)
Out[96]:
56
In [97]:
plt.figure(figsize=(100,100))

for i, column in enumerate(LL,1):
    plt.subplot(8,7,i)

    cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=25,color='red',label='Attrited',alpha=1.0)
    cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=25,color='orange',label='Retained',alpha=0.5)
    plt.legend()
    plt.xlabel(column)
In [98]:
plt.figure(figsize=(25,15))

for i, column in enumerate(demo1,1):
    plt.subplot(2,2,i)
    cf[cf["RETURNED_2ND_YR"] == 1][column].value_counts().plot.bar(color='Orange',label='Retained',alpha=0.5)
    cf[cf["RETURNED_2ND_YR"] == 0][column].value_counts().plot.bar(color='red',label='Attrited',alpha=1)
    plt.legend()
    plt.xlabel(column)
In [99]:
plt.figure(figsize=(30, 30))

for i, column in enumerate(demo2, 1):
    plt.subplot(6, 4, i)
    cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=10,color='orange',label='Retained',alpha=0.5)
    cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=10,color='red',label='Attrited',alpha=1.0)
    plt.legend()
    plt.xlabel(column)
In [100]:
plt.figure(figsize=(30, 30))

for i, column in enumerate(demo3, 1):
    plt.subplot(6, 4, i)
    cf[cf["RETURNED_2ND_YR"] == 1][column].hist(bins=10,color='orange',label='Retained',alpha=0.5)
    cf[cf["RETURNED_2ND_YR"] == 0][column].hist(bins=10,color='red',label='Attrited',alpha=1)
    plt.legend()
    plt.xlabel(column)

Now we will see by Pandas profiling the total details of all the variables for Histogram, missing values, & the correlations That will help us to choose & identify the key Drivers It is a simple and fast way to perform exploratory data analysis of a Pandas Dataframe.

CORRELATIONS BETWEEN VARIABLES

In [101]:
col=cf.corr().nlargest(10,"RETURNED_2ND_YR").RETURNED_2ND_YR.index
sns.heatmap(cf[col].corr(), annot=True, cmap="RdYlGn", annot_kws={"size":10})
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f05b17710>
In [102]:
cf.drop("RETURNED_2ND_YR", axis=1).corrwith(cf.RETURNED_2ND_YR).plot(kind='barh', figsize=(10, 7))
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f2b5015f8>
In [103]:
#SECOND_TERM_EARNED_HRS & SECOND_TERM_ATTEMPT_HRS  & HIGH School GPA are highly correlated

#Second term & First term are negative highly co-related.

#AGE & Student Identifier are co-related.

#Unmet needs & Distance From Home  are negaticely co-related.
In [104]:
cf.isnull().sum()
Out[104]:
STUDENT IDENTIFIER             0
STDNT_AGE                      0
STDNT_GENDER                   0
STDNT_BACKGROUND               0
IN_STATE_FLAG                  0
INTERNATIONAL_STS              0
STDNT_MAJOR                    0
STDNT_MINOR                    0
STDNT_TEST_ENTRANCE1        2294
STDNT_TEST_ENTRANCE2         908
STDNT_TEST_ENTRANCE_COMB     518
FIRST_TERM                     0
CORE_COURSE_NAME_1_F           0
CORE_COURSE_GRADE_1_F          0
CORE_COURSE_NAME_2_F          99
CORE_COURSE_GRADE_2_F         99
CORE_COURSE_NAME_3_F         565
CORE_COURSE_GRADE_3_F        565
CORE_COURSE_NAME_4_F        1597
CORE_COURSE_GRADE_4_F       1597
CORE_COURSE_NAME_5_F        2755
CORE_COURSE_GRADE_5_F       2755
CORE_COURSE_NAME_6_F        3272
CORE_COURSE_GRADE_6_F       3272
SECOND_TERM                    0
CORE_COURSE_NAME_1_S         157
CORE_COURSE_GRADE_1_S        232
CORE_COURSE_NAME_2_S         439
CORE_COURSE_GRADE_2_S        439
CORE_COURSE_NAME_3_S        1038
CORE_COURSE_GRADE_3_S       1038
CORE_COURSE_NAME_4_S        2045
CORE_COURSE_GRADE_4_S       2045
CORE_COURSE_NAME_5_S        2950
CORE_COURSE_GRADE_5_S       2950
CORE_COURSE_NAME_6_S        3319
CORE_COURSE_GRADE_6_S       3319
HOUSING_STS                    0
RETURNED_2ND_YR                0
DISTANCE_FROM_HOME            25
HIGH_SCHL_GPA                 53
HIGH_SCHL_NAME                 1
FATHER_HI_EDU_CD             432
FATHER_HI_EDU_DESC             0
MOTHER_HI_EDU_CD             489
MOTHER_HI_EDU_DESC             0
DEGREE_GROUP_CD                0
DEGREE_GROUP_DESC              0
FIRST_TERM_ATTEMPT_HRS         0
FIRST_TERM_EARNED_HRS          0
SECOND_TERM_ATTEMPT_HRS      206
SECOND_TERM_EARNED_HRS       209
GROSS_FIN_NEED                 0
COST_OF_ATTEND                 0
EST_FAM_CONTRIBUTION           0
UNMET_NEED                     0
dtype: int64
In [105]:
cf.shape
Out[105]:
(3400, 56)
In [106]:
cf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3400 entries, 0 to 3399
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   STUDENT IDENTIFIER        3400 non-null   int64  
 1   STDNT_AGE                 3400 non-null   int64  
 2   STDNT_GENDER              3400 non-null   object 
 3   STDNT_BACKGROUND          3400 non-null   object 
 4   IN_STATE_FLAG             3400 non-null   object 
 5   INTERNATIONAL_STS         3400 non-null   object 
 6   STDNT_MAJOR               3400 non-null   object 
 7   STDNT_MINOR               3400 non-null   object 
 8   STDNT_TEST_ENTRANCE1      1106 non-null   float64
 9   STDNT_TEST_ENTRANCE2      2492 non-null   float64
 10  STDNT_TEST_ENTRANCE_COMB  2882 non-null   float64
 11  FIRST_TERM                3400 non-null   int64  
 12  CORE_COURSE_NAME_1_F      3400 non-null   object 
 13  CORE_COURSE_GRADE_1_F     3400 non-null   object 
 14  CORE_COURSE_NAME_2_F      3301 non-null   object 
 15  CORE_COURSE_GRADE_2_F     3301 non-null   object 
 16  CORE_COURSE_NAME_3_F      2835 non-null   object 
 17  CORE_COURSE_GRADE_3_F     2835 non-null   object 
 18  CORE_COURSE_NAME_4_F      1803 non-null   object 
 19  CORE_COURSE_GRADE_4_F     1803 non-null   object 
 20  CORE_COURSE_NAME_5_F      645 non-null    object 
 21  CORE_COURSE_GRADE_5_F     645 non-null    object 
 22  CORE_COURSE_NAME_6_F      128 non-null    object 
 23  CORE_COURSE_GRADE_6_F     128 non-null    object 
 24  SECOND_TERM               3400 non-null   int64  
 25  CORE_COURSE_NAME_1_S      3243 non-null   object 
 26  CORE_COURSE_GRADE_1_S     3168 non-null   object 
 27  CORE_COURSE_NAME_2_S      2961 non-null   object 
 28  CORE_COURSE_GRADE_2_S     2961 non-null   object 
 29  CORE_COURSE_NAME_3_S      2362 non-null   object 
 30  CORE_COURSE_GRADE_3_S     2362 non-null   object 
 31  CORE_COURSE_NAME_4_S      1355 non-null   object 
 32  CORE_COURSE_GRADE_4_S     1355 non-null   object 
 33  CORE_COURSE_NAME_5_S      450 non-null    object 
 34  CORE_COURSE_GRADE_5_S     450 non-null    object 
 35  CORE_COURSE_NAME_6_S      81 non-null     object 
 36  CORE_COURSE_GRADE_6_S     81 non-null     object 
 37  HOUSING_STS               3400 non-null   object 
 38  RETURNED_2ND_YR           3400 non-null   int64  
 39  DISTANCE_FROM_HOME        3375 non-null   float64
 40  HIGH_SCHL_GPA             3347 non-null   float64
 41  HIGH_SCHL_NAME            3399 non-null   object 
 42  FATHER_HI_EDU_CD          2968 non-null   float64
 43  FATHER_HI_EDU_DESC        3400 non-null   object 
 44  MOTHER_HI_EDU_CD          2911 non-null   float64
 45  MOTHER_HI_EDU_DESC        3400 non-null   object 
 46  DEGREE_GROUP_CD           3400 non-null   object 
 47  DEGREE_GROUP_DESC         3400 non-null   object 
 48  FIRST_TERM_ATTEMPT_HRS    3400 non-null   int64  
 49  FIRST_TERM_EARNED_HRS     3400 non-null   int64  
 50  SECOND_TERM_ATTEMPT_HRS   3194 non-null   float64
 51  SECOND_TERM_EARNED_HRS    3191 non-null   float64
 52  GROSS_FIN_NEED            3400 non-null   int64  
 53  COST_OF_ATTEND            3400 non-null   int64  
 54  EST_FAM_CONTRIBUTION      3400 non-null   int64  
 55  UNMET_NEED                3400 non-null   float64
dtypes: float64(10), int64(10), object(36)
memory usage: 1.5+ MB
In [107]:
cf.describe()
Out[107]:
STUDENT IDENTIFIER STDNT_AGE STDNT_TEST_ENTRANCE1 STDNT_TEST_ENTRANCE2 STDNT_TEST_ENTRANCE_COMB FIRST_TERM SECOND_TERM RETURNED_2ND_YR DISTANCE_FROM_HOME HIGH_SCHL_GPA FATHER_HI_EDU_CD MOTHER_HI_EDU_CD FIRST_TERM_ATTEMPT_HRS FIRST_TERM_EARNED_HRS SECOND_TERM_ATTEMPT_HRS SECOND_TERM_EARNED_HRS GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION UNMET_NEED
count 3.400000e+03 3400.000000 1106.000000 2492.000000 2882.000000 3400.000000 3400.000000 3400.000000 3375.000000 3347.000000 2968.000000 2911.000000 3400.000000 3400.000000 3194.000000 3191.000000 3.400000e+03 3.400000e+03 3.400000e+03 3.400000e+03
mean 7.897155e+06 17.994118 20.660036 965.092295 997.758501 200777.000000 200871.000000 0.787353 99.825185 3.202738 2.560647 2.571281 13.985882 12.205000 14.285848 12.511125 3.034025e+05 5.515335e+05 3.446899e+05 7.222052e+04
std 8.083421e+04 0.552052 3.234734 209.088633 156.268221 170.001298 170.001298 0.409240 235.196984 0.455337 0.681273 0.611729 1.554286 2.986294 2.104387 3.470492 4.836255e+05 6.064403e+05 7.871878e+05 2.795679e+05
min 7.755837e+06 16.000000 8.000000 350.000000 530.000000 200508.000000 200602.000000 0.000000 0.000000 0.000000 1.000000 1.000000 9.000000 0.000000 2.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 -1.212072e+06
25% 7.827039e+06 18.000000 18.000000 900.000000 910.000000 200608.000000 200702.000000 1.000000 69.000000 2.870000 2.000000 2.000000 13.000000 11.000000 13.000000 11.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 7.899052e+06 18.000000 20.000000 990.000000 990.000000 200808.000000 200902.000000 1.000000 69.000000 3.200000 3.000000 3.000000 14.000000 13.000000 14.000000 13.000000 0.000000e+00 1.505400e+05 0.000000e+00 0.000000e+00
75% 7.964008e+06 18.000000 22.000000 1100.000000 1110.000000 200908.000000 201002.000000 1.000000 138.000000 3.560000 3.000000 3.000000 15.000000 15.000000 16.000000 15.000000 5.995800e+05 1.192815e+06 3.241800e+05 1.876155e+05
max 8.037098e+06 26.000000 32.000000 1490.000000 1510.000000 201008.000000 201102.000000 1.000000 5932.000000 4.000000 4.000000 4.000000 21.000000 21.000000 23.000000 23.000000 2.124900e+06 2.124900e+06 5.999940e+06 1.632660e+06
In [108]:
cf.columns.to_series().groupby(cf.dtypes).groups
Out[108]:
{dtype('int64'): Index(['STUDENT IDENTIFIER', 'STDNT_AGE', 'FIRST_TERM', 'SECOND_TERM',
        'RETURNED_2ND_YR', 'FIRST_TERM_ATTEMPT_HRS', 'FIRST_TERM_EARNED_HRS',
        'GROSS_FIN_NEED', 'COST_OF_ATTEND', 'EST_FAM_CONTRIBUTION'],
       dtype='object'),
 dtype('float64'): Index(['STDNT_TEST_ENTRANCE1', 'STDNT_TEST_ENTRANCE2',
        'STDNT_TEST_ENTRANCE_COMB', 'DISTANCE_FROM_HOME', 'HIGH_SCHL_GPA',
        'FATHER_HI_EDU_CD', 'MOTHER_HI_EDU_CD', 'SECOND_TERM_ATTEMPT_HRS',
        'SECOND_TERM_EARNED_HRS', 'UNMET_NEED'],
       dtype='object'),
 dtype('O'): Index(['STDNT_GENDER', 'STDNT_BACKGROUND', 'IN_STATE_FLAG',
        'INTERNATIONAL_STS', 'STDNT_MAJOR', 'STDNT_MINOR',
        'CORE_COURSE_NAME_1_F', 'CORE_COURSE_GRADE_1_F', 'CORE_COURSE_NAME_2_F',
        'CORE_COURSE_GRADE_2_F', 'CORE_COURSE_NAME_3_F',
        'CORE_COURSE_GRADE_3_F', 'CORE_COURSE_NAME_4_F',
        'CORE_COURSE_GRADE_4_F', 'CORE_COURSE_NAME_5_F',
        'CORE_COURSE_GRADE_5_F', 'CORE_COURSE_NAME_6_F',
        'CORE_COURSE_GRADE_6_F', 'CORE_COURSE_NAME_1_S',
        'CORE_COURSE_GRADE_1_S', 'CORE_COURSE_NAME_2_S',
        'CORE_COURSE_GRADE_2_S', 'CORE_COURSE_NAME_3_S',
        'CORE_COURSE_GRADE_3_S', 'CORE_COURSE_NAME_4_S',
        'CORE_COURSE_GRADE_4_S', 'CORE_COURSE_NAME_5_S',
        'CORE_COURSE_GRADE_5_S', 'CORE_COURSE_NAME_6_S',
        'CORE_COURSE_GRADE_6_S', 'HOUSING_STS', 'HIGH_SCHL_NAME',
        'FATHER_HI_EDU_DESC', 'MOTHER_HI_EDU_DESC', 'DEGREE_GROUP_CD',
        'DEGREE_GROUP_DESC'],
       dtype='object')}
In [109]:
vp_new1 = cf[["RETURNED_2ND_YR","GROSS_FIN_NEED","COST_OF_ATTEND","EST_FAM_CONTRIBUTION",
         "DISTANCE_FROM_HOME","UNMET_NEED","STDNT_GENDER","HOUSING_STS","HIGH_SCHL_GPA","STDNT_MAJOR","STDNT_MINOR",'FATHER_HI_EDU_CD', 'MOTHER_HI_EDU_CD','IN_STATE_FLAG','FIRST_TERM_EARNED_HRS','SECOND_TERM_EARNED_HRS',]].copy()

New data frame created with critical factors which are significant for the student who left..And these critical factors are taken from the mean score with respect to Returned second year from histogram analysis, correlation matrix and pandas profiling..

In [110]:
vp_new1.head()
Out[110]:
RETURNED_2ND_YR GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA STDNT_MAJOR STDNT_MINOR FATHER_HI_EDU_CD MOTHER_HI_EDU_CD IN_STATE_FLAG FIRST_TERM_EARNED_HRS SECOND_TERM_EARNED_HRS
0 0 0 0 0 150.0 0.0 F On Campus 4.00 Undeclared N NaN NaN Y 16 14.0
1 1 570000 1355760 785760 69.0 459300.0 F Off Campus 2.89 Undeclared N 3.0 3.0 N 18 18.0
2 1 0 0 0 150.0 0.0 M On Campus 3.39 Mathematics N 2.0 4.0 Y 15 14.0
3 1 0 0 0 150.0 0.0 M On Campus 2.93 Undeclared N 3.0 3.0 Y 13 14.0
4 1 835920 1355760 519840 69.0 278340.0 F Off Campus 3.86 Art N 3.0 2.0 Y 12 12.0
In [111]:
vp_new1['RETURNED_2ND_YR']=vp_new1['RETURNED_2ND_YR'].map(lambda x: 1 if x== 0 else 0)

This function used to convert 0 to 1 and 1 to 0, as python consider only 1 for modelling... and in our data 0 means not returned to second year

In [112]:
vp_new1.head()
Out[112]:
RETURNED_2ND_YR GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA STDNT_MAJOR STDNT_MINOR FATHER_HI_EDU_CD MOTHER_HI_EDU_CD IN_STATE_FLAG FIRST_TERM_EARNED_HRS SECOND_TERM_EARNED_HRS
0 1 0 0 0 150.0 0.0 F On Campus 4.00 Undeclared N NaN NaN Y 16 14.0
1 0 570000 1355760 785760 69.0 459300.0 F Off Campus 2.89 Undeclared N 3.0 3.0 N 18 18.0
2 0 0 0 0 150.0 0.0 M On Campus 3.39 Mathematics N 2.0 4.0 Y 15 14.0
3 0 0 0 0 150.0 0.0 M On Campus 2.93 Undeclared N 3.0 3.0 Y 13 14.0
4 0 835920 1355760 519840 69.0 278340.0 F Off Campus 3.86 Art N 3.0 2.0 Y 12 12.0
In [113]:
vp_new1.shape
Out[113]:
(3400, 16)
In [114]:
vp_new1.isnull().sum()
Out[114]:
RETURNED_2ND_YR             0
GROSS_FIN_NEED              0
COST_OF_ATTEND              0
EST_FAM_CONTRIBUTION        0
DISTANCE_FROM_HOME         25
UNMET_NEED                  0
STDNT_GENDER                0
HOUSING_STS                 0
HIGH_SCHL_GPA              53
STDNT_MAJOR                 0
STDNT_MINOR                 0
FATHER_HI_EDU_CD          432
MOTHER_HI_EDU_CD          489
IN_STATE_FLAG               0
FIRST_TERM_EARNED_HRS       0
SECOND_TERM_EARNED_HRS    209
dtype: int64

Machine Learning algorithms can typically only have numerical values as their predictor variables. Hence Label Encoding becomes necessary as they encode categorical labels with numerical values. To avoid introducing feature importance for categorical features with large numbers of unique values, we will use both Lable Encoding and One-Hot Encoding as shown below.

In [115]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a label encoder object
le = LabelEncoder()
In [116]:
# Label Encoding will be used for columns with 2 or less unique values
le_count = 0
for col in vp_new1.columns[1:]:
    if vp_new1[col].dtype == 'object':
        if len(list(vp_new1[col].unique())) <= 2:
            le.fit(vp_new1[col])
            vp_new1[col] = le.transform(vp_new1[col])
            le_count += 1
print('{} columns were label encoded.'.format(le_count))
3 columns were label encoded.
In [117]:
# convert rest of categorical variable into dummy
vp_new1 = pd.get_dummies(vp_new1, drop_first=True)
In [118]:
vp_new1.head()
Out[118]:
RETURNED_2ND_YR GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA FATHER_HI_EDU_CD ... STDNT_MINOR_Music STDNT_MINOR_N STDNT_MINOR_Philosophy STDNT_MINOR_Political Science STDNT_MINOR_Professional Writing STDNT_MINOR_Psychology STDNT_MINOR_Sociology STDNT_MINOR_Spanish STDNT_MINOR_Theatre Arts STDNT_MINOR_Women's Studies
0 1 0 0 0 150.0 0.0 0 1 4.00 NaN ... 0 1 0 0 0 0 0 0 0 0
1 0 570000 1355760 785760 69.0 459300.0 0 0 2.89 3.0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 150.0 0.0 1 1 3.39 2.0 ... 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 150.0 0.0 1 1 2.93 3.0 ... 0 1 0 0 0 0 0 0 0 0
4 0 835920 1355760 519840 69.0 278340.0 0 0 3.86 3.0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 103 columns

Here lets check the missing values if any

In [119]:
vp_new1.isnull().sum()
Out[119]:
RETURNED_2ND_YR                 0
GROSS_FIN_NEED                  0
COST_OF_ATTEND                  0
EST_FAM_CONTRIBUTION            0
DISTANCE_FROM_HOME             25
                               ..
STDNT_MINOR_Psychology          0
STDNT_MINOR_Sociology           0
STDNT_MINOR_Spanish             0
STDNT_MINOR_Theatre Arts        0
STDNT_MINOR_Women's Studies     0
Length: 103, dtype: int64

Now Data cleaning by imputing missing values--Treating null values with their mean and mode respectively

In [120]:
vp_new1['HIGH_SCHL_GPA'].isnull().sum()
Out[120]:
53
In [121]:
vp_new1['HIGH_SCHL_GPA'].fillna(3.2,inplace=True)
In [122]:
vp_new1['HIGH_SCHL_GPA'].isnull().sum()
Out[122]:
0
In [123]:
vp_new1['FATHER_HI_EDU_CD'].isnull().sum()
Out[123]:
432
In [124]:
vp_new1['FATHER_HI_EDU_CD'].fillna(2.56,inplace=True)
In [125]:
vp_new1['FATHER_HI_EDU_CD'].isnull().sum()
Out[125]:
0
In [126]:
vp_new1['MOTHER_HI_EDU_CD'].isnull().sum()
Out[126]:
489
In [127]:
vp_new1['MOTHER_HI_EDU_CD'].fillna(2.57,inplace=True)
In [128]:
vp_new1['MOTHER_HI_EDU_CD'].isnull().sum()
Out[128]:
0
In [129]:
vp_new1['COST_OF_ATTEND'].isnull().sum()
Out[129]:
0
In [130]:
vp_new1['DISTANCE_FROM_HOME'].fillna(112,inplace=True)
In [131]:
vp_new1['HIGH_SCHL_GPA'].fillna(3.2,inplace=True)
In [132]:
vp_new1['FIRST_TERM_EARNED_HRS'].isnull().sum()
Out[132]:
0
In [133]:
vp_new1['SECOND_TERM_EARNED_HRS'].isnull().sum()
Out[133]:
209
In [134]:
vp_new1['SECOND_TERM_EARNED_HRS'].fillna(vp_new1['SECOND_TERM_EARNED_HRS'].mean(),inplace=True)
In [135]:
vp_new1['HIGH_SCHL_GPA'].isnull().sum()
Out[135]:
0
In [136]:
vp_new1.isnull().sum()
Out[136]:
RETURNED_2ND_YR                0
GROSS_FIN_NEED                 0
COST_OF_ATTEND                 0
EST_FAM_CONTRIBUTION           0
DISTANCE_FROM_HOME             0
                              ..
STDNT_MINOR_Psychology         0
STDNT_MINOR_Sociology          0
STDNT_MINOR_Spanish            0
STDNT_MINOR_Theatre Arts       0
STDNT_MINOR_Women's Studies    0
Length: 103, dtype: int64
In [137]:
print('Size of Full Encoded Dataset: {}'. format(vp_new1.shape))
Size of Full Encoded Dataset: (3400, 103)

Splitting data into training and testing sets

Prior to implementating or applying any Machine Learning algorithms, we must decouple training and testing datasets from our master dataframe.

The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model's prediction on this subset . Train/test is used to measure the accuracy of model

In [138]:
# assign the target to a new dataframe and convert it to a numerical feature

target = vp_new1['RETURNED_2ND_YR'].copy()
In [139]:
from sklearn.model_selection import train_test_split
In [140]:
vp_new1.head()
Out[140]:
RETURNED_2ND_YR GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA FATHER_HI_EDU_CD ... STDNT_MINOR_Music STDNT_MINOR_N STDNT_MINOR_Philosophy STDNT_MINOR_Political Science STDNT_MINOR_Professional Writing STDNT_MINOR_Psychology STDNT_MINOR_Sociology STDNT_MINOR_Spanish STDNT_MINOR_Theatre Arts STDNT_MINOR_Women's Studies
0 1 0 0 0 150.0 0.0 0 1 4.00 2.56 ... 0 1 0 0 0 0 0 0 0 0
1 0 570000 1355760 785760 69.0 459300.0 0 0 2.89 3.00 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 150.0 0.0 1 1 3.39 2.00 ... 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 150.0 0.0 1 1 2.93 3.00 ... 0 1 0 0 0 0 0 0 0 0
4 0 835920 1355760 519840 69.0 278340.0 0 0 3.86 3.00 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 103 columns

In [141]:
# let's use stratify=y to maintain the same ratio as in the training dataset when splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(vp_new1,
                                                    target,
                                                    test_size=0.25,
                                                    random_state=7,
                                                    stratify=target)  
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
Number transactions X_train dataset:  (2550, 103)
Number transactions y_train dataset:  (2550,)
Number transactions X_test dataset:  (850, 103)
Number transactions y_test dataset:  (850,)
In [142]:
X_train.head()
Out[142]:
RETURNED_2ND_YR GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA FATHER_HI_EDU_CD ... STDNT_MINOR_Music STDNT_MINOR_N STDNT_MINOR_Philosophy STDNT_MINOR_Political Science STDNT_MINOR_Professional Writing STDNT_MINOR_Psychology STDNT_MINOR_Sociology STDNT_MINOR_Spanish STDNT_MINOR_Theatre Arts STDNT_MINOR_Women's Studies
1402 0 0 0 0 69.0 0.0 0 1 3.78 2.0 ... 0 1 0 0 0 0 0 0 0 0
2593 0 1146420 1146420 0 90.0 421740.0 0 1 3.48 2.0 ... 0 1 0 0 0 0 0 0 0 0
838 0 0 0 0 69.0 0.0 0 0 2.92 3.0 ... 0 1 0 0 0 0 0 0 0 0
2694 0 0 0 0 69.0 0.0 1 0 2.80 3.0 ... 0 1 0 0 0 0 0 0 0 0
2497 0 0 0 0 69.0 0.0 0 0 2.63 2.0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 103 columns

In [143]:
X=vp_new1.drop("RETURNED_2ND_YR",axis=1)
In [144]:
y=vp_new1["RETURNED_2ND_YR"]
In [145]:
X.head()
Out[145]:
GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA FATHER_HI_EDU_CD MOTHER_HI_EDU_CD ... STDNT_MINOR_Music STDNT_MINOR_N STDNT_MINOR_Philosophy STDNT_MINOR_Political Science STDNT_MINOR_Professional Writing STDNT_MINOR_Psychology STDNT_MINOR_Sociology STDNT_MINOR_Spanish STDNT_MINOR_Theatre Arts STDNT_MINOR_Women's Studies
0 0 0 0 150.0 0.0 0 1 4.00 2.56 2.57 ... 0 1 0 0 0 0 0 0 0 0
1 570000 1355760 785760 69.0 459300.0 0 0 2.89 3.00 3.00 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 150.0 0.0 1 1 3.39 2.00 4.00 ... 0 1 0 0 0 0 0 0 0 0
3 0 0 0 150.0 0.0 1 1 2.93 3.00 3.00 ... 0 1 0 0 0 0 0 0 0 0
4 835920 1355760 519840 69.0 278340.0 0 0 3.86 3.00 2.00 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 102 columns

In [146]:
y.head()
Out[146]:
0    1
1    0
2    0
3    0
4    0
Name: RETURNED_2ND_YR, dtype: int64
In [147]:
# train and test are used to simulate the entire data..which will be used for building model for prediction
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=7)
In [148]:
X_train.head()
Out[148]:
GROSS_FIN_NEED COST_OF_ATTEND EST_FAM_CONTRIBUTION DISTANCE_FROM_HOME UNMET_NEED STDNT_GENDER HOUSING_STS HIGH_SCHL_GPA FATHER_HI_EDU_CD MOTHER_HI_EDU_CD ... STDNT_MINOR_Music STDNT_MINOR_N STDNT_MINOR_Philosophy STDNT_MINOR_Political Science STDNT_MINOR_Professional Writing STDNT_MINOR_Psychology STDNT_MINOR_Sociology STDNT_MINOR_Spanish STDNT_MINOR_Theatre Arts STDNT_MINOR_Women's Studies
2837 0 269100 0 0.0 -231720.0 0 1 3.64 4.00 2.00 ... 0 1 0 0 0 0 0 0 0 0
2337 1409460 1409460 0 0.0 461760.0 0 0 3.13 2.00 2.00 ... 0 1 0 0 0 0 0 0 0 0
262 0 326880 0 0.0 -356880.0 1 0 3.73 2.56 2.57 ... 0 1 0 0 0 0 0 0 0 0
524 0 0 0 0.0 0.0 0 1 2.61 3.00 2.57 ... 0 1 0 0 0 0 0 0 0 0
864 0 0 0 69.0 0.0 0 0 3.30 2.00 2.00 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 102 columns

Building Machine Learning Models

Baseline Algorithms Let's first use a range of baseline algorithms (using out-of-the-box hyper-parameters) before we move on to more sophisticated solutions.

In [149]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split  
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import xgboost as XG

# Libraries for data modelling
from sklearn import svm, tree, linear_model, neighbors
from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# Common sklearn Model Helpers
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics


# sklearn modules for performance metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
from sklearn.metrics import auc, roc_auc_score, roc_curve, recall_score, log_loss
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, make_scorer
from sklearn.metrics import average_precision_score
D:\Anaconda\lib\site-packages\dask\dataframe\utils.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

## The algorithms considered in this section are: Logistic Regression, Random Forest, Decision Tree Classifier, XGboost

In [150]:
# selection of algorithms to consider and set performance measure
models = []
models.append(('Logistic Regression', LogisticRegression(solver='liblinear', random_state=7,
                                                         class_weight='balanced')))
models.append(('Random Forest', RandomForestClassifier(
    n_estimators=100, random_state=7)))
models.append(('Decision Tree Classifier',
               DecisionTreeClassifier(random_state=7)))
models.append(('xgboost',
               XGBClassifier(random_state=7)))
In [151]:
acc_results = []
auc_results = []
names = []
# set table to table to populate with performance results
col = ['Algorithm', 'ROC AUC Mean', 'ROC AUC STD', 
       'Accuracy Mean', 'Accuracy STD']
cf_results = pd.DataFrame(columns=col)
i = 0
# evaluate each model using cross-validation
for name, model in models:
    kfold = model_selection.KFold(
        n_splits=10, random_state=7)  # 10-fold cross-validation

    cv_acc_results = model_selection.cross_val_score(  # accuracy scoring
        model, X_train, y_train, cv=kfold, scoring='accuracy')

    cv_auc_results = model_selection.cross_val_score(  # roc_auc scoring
        model, X_train, y_train, cv=kfold, scoring='roc_auc')

    acc_results.append(cv_acc_results)
    auc_results.append(cv_auc_results)
    names.append(name)
    cf_results.loc[i] = [name,
                         round(cv_auc_results.mean()*100, 2),
                         round(cv_auc_results.std()*100, 2),
                         round(cv_acc_results.mean()*100, 2),
                         round(cv_acc_results.std()*100, 2)
                         ]
    i += 1
cf_results.sort_values(by=['ROC AUC Mean'], ascending=False)
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
Out[151]:
Algorithm ROC AUC Mean ROC AUC STD Accuracy Mean Accuracy STD
3 xgboost 68.95 5.30 80.84 1.85
1 Random Forest 68.43 4.14 81.30 1.76
2 Decision Tree Classifier 61.77 3.26 73.07 2.45
0 Logistic Regression 55.24 4.50 60.00 8.10

Classification Accuracy is the number of correct predictions made as a ratio of all predictions made. It is the most common evaluation metric for classification problems. However, it is often misused as it is only really suitable when there are an equal number of observations in each class and all predictions and prediction errors are equally important. It is not the case in this project, so a different scoring metric may be more suitable.

Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid, performing K-Fold CV with each combination of values.

In [152]:
fig = plt.figure(figsize=(15,7))
fig.suptitle('Algorithm Accuracy Comparison')
ax = fig.add_subplot(111)
plt.boxplot(acc_results)
ax.set_xticklabels(names)
plt.show()

Above box plot for all modes shows the Accuracy comparison between each of them, Random forest and XGBOOST are at higher level, so selecting this 2 model for this case study for further analysis

In [153]:
fig = plt.figure(figsize=(15, 7))
fig.suptitle('Algorithm ROC AUC Comparison')
ax = fig.add_subplot(111)
plt.boxplot(auc_results)
ax.set_xticklabels(names)
plt.show()

Based on our ROC AUC comparison analysis, XGBoost and Random Forest show the highest mean AUC scores. We will shortlist these two algorithms for further analysis.

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. Logistic Regression is classification algorithm that is not as sophisticated as the ensemble methods or boosted decision trees method discussed below. Hence, it provides us with a good benchmark.

Random Forest is a popular and versatile machine learning method that is capable of solving both regression and classification. Random Forest is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It aggregates Classification (or Regression) Trees. A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.

Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a large number of features, and is helpful for estimating which of your variables are important in the underlying data being modeled

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance that is dominative competitive machine learning.

The two reasons to use XGBoost are also the two goals of the project: Execution Speed. Model Performance.

a) XGBoost Execution Speed Generally, XGBoost is fast. Really fast when compared to other implementations of gradient boosting.

b) XGBoost Model Performance XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.

In [154]:
from imblearn.over_sampling import SMOTE
from collections import Counter
In [155]:
SS = SMOTE()
X, y = SS.fit_resample(X, y)
In [156]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)
print(X.shape)
print(y.shape)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(5354, 102)
(5354,)
(3747, 102)
(3747,)
(1607, 102)
(1607,)
In [157]:
acc_results = []
auc_results = []
names = []
# set table to table to populate with performance results
col = ['Algorithm', 'ROC AUC Mean', 'ROC AUC STD', 
       'Accuracy Mean', 'Accuracy STD']
cf_results = pd.DataFrame(columns=col)
i = 0
# evaluate each model using cross-validation
for name, model in models:
    kfold = model_selection.KFold(
        n_splits=10, random_state=7)  # 10-fold cross-validation

    cv_acc_results = model_selection.cross_val_score(  # accuracy scoring
        model, X_train, y_train, cv=kfold, scoring='accuracy')

    cv_auc_results = model_selection.cross_val_score(  # roc_auc scoring
        model, X_train, y_train, cv=kfold, scoring='roc_auc')

    acc_results.append(cv_acc_results)
    auc_results.append(cv_auc_results)
    names.append(name)
    cf_results.loc[i] = [name,
                         round(cv_auc_results.mean()*100, 2),
                         round(cv_auc_results.std()*100, 2),
                         round(cv_acc_results.mean()*100, 2),
                         round(cv_acc_results.std()*100, 2)
                         ]
    i += 1
cf_results.sort_values(by=['ROC AUC Mean'], ascending=False)
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
Out[157]:
Algorithm ROC AUC Mean ROC AUC STD Accuracy Mean Accuracy STD
1 Random Forest 92.75 1.01 86.50 1.24
3 xgboost 92.20 1.10 86.87 1.73
2 Decision Tree Classifier 79.46 1.14 79.45 1.24
0 Logistic Regression 57.70 2.69 55.11 3.72
In [200]:
fig = plt.figure(figsize=(15,10))
fig.suptitle('Algorithm Accuracy Comparison')
ax = fig.add_subplot(111)
plt.boxplot(acc_results)
ax.set_xticklabels(names)
plt.show()
In [201]:
fig = plt.figure(figsize=(15, 7))
fig.suptitle('Algorithm ROC AUC Comparison')
ax = fig.add_subplot(111)
plt.boxplot(auc_results)
ax.set_xticklabels(names)
plt.show()

Random Forest Classifier

RF is much easier to tune than GBM. There are typically two parameters in RF: number of trees and number of features to be selected at each node. RF is harder to overfit than GBM.

This method is especially attractive for this application in the following cases: The real-world data is noisy and contains many missing values, some of the attributes are categorical, or semi-continuous. There are needs to integrate different data sources which face the issue of weighting them. We need high predictive accuracy for a high-dimensional problem with highly correlated features.

RF methods can handle a large amount of training data efficiently and are inherently suited for multi-class problems. Let's take a closer look at using the Random Forest algorithm. Fine-Tuning the Random Forest algorithm's hyper-parameters by cross-validation against the AUC score.

Let's take a closer look at using the Random Forest algorithm. we shall fine-tune the Random Forest algorithm's hyper-parameters by cross-validation against the AUC score.

As shown above, the results from GridSearchCV provided us with fine-tuned hyper-parameter using ROC_AUC as the scoring metric

Best score --0.92 shows the prediction accuracy for this model. It is achieved by performing Grid Search based on the estimators.

In [158]:
from sklearn.model_selection import GridSearchCV
In [159]:
param_grid = {"n_estimators":[50, 75, 100, 125, 150, 175]}
In [160]:
rf_classifier = RandomForestClassifier(random_state=7,max_features='sqrt',class_weight='balanced_subsample')
In [161]:
grid_obj = GridSearchCV(rf_classifier, param_grid,n_jobs=-1,scoring='roc_auc',cv=10)
grid_obj.fit(X_train, y_train)
Out[161]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                              max_features='sqrt',
                                              random_state=7),
             n_jobs=-1,
             param_grid={'n_estimators': [50, 75, 100, 125, 150, 175]},
             scoring='roc_auc')
In [162]:
grid_obj.best_score_
Out[162]:
0.929605538157505
In [163]:
param_grid1 = {"max_depth":[5, 10, 15, 20, 25]}

grid_obj1 = GridSearchCV(rf_classifier, param_grid1,n_jobs=-1,scoring='roc_auc',cv=10)
grid_obj1.fit(X_train, y_train)
Out[163]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                              max_features='sqrt',
                                              random_state=7),
             n_jobs=-1, param_grid={'max_depth': [5, 10, 15, 20, 25]},
             scoring='roc_auc')
In [164]:
param_grid2 = {"min_samples_split":[2,4,6,8,10]}

grid_obj2 = GridSearchCV(rf_classifier, param_grid2,n_jobs=-1,scoring='roc_auc',cv=10)
grid_obj2.fit(X_train, y_train)
Out[164]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                              max_features='sqrt',
                                              random_state=7),
             n_jobs=-1, param_grid={'min_samples_split': [2, 4, 6, 8, 10]},
             scoring='roc_auc')
In [165]:
param_grid3 = {"min_samples_leaf": [1, 2, 3, 4]}

grid_obj3 = GridSearchCV(rf_classifier, param_grid3,n_jobs=-1,scoring='roc_auc',cv=10)
grid_obj3.fit(X_train, y_train)
Out[165]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                              max_features='sqrt',
                                              random_state=7),
             n_jobs=-1, param_grid={'min_samples_leaf': [1, 2, 3, 4]},
             scoring='roc_auc')
In [166]:
plt.figure(figsize=(20,20))
fig,[[ax,ax1],[ax2,ax3]]=plt.subplots(2,2,figsize=(20,10))
ax3.plot(grid_obj3.cv_results_['mean_test_score']),
ax2.plot(grid_obj2.cv_results_['mean_test_score']),
ax1.plot(grid_obj1.cv_results_['mean_test_score']),
ax.plot(grid_obj.cv_results_['mean_test_score'])

ax.set_xlabel("n_estimators")
ax1.set_xlabel("max_depth")
ax2.set_xlabel("min_samples_split")
ax3.set_xlabel("min_samples_leaf")
Out[166]:
Text(0.5, 0, 'min_samples_leaf')
<Figure size 1440x1440 with 0 Axes>
In [167]:
print(grid_obj.best_params_)

print(grid_obj1.best_params_)

print(grid_obj2.best_params_)

print(grid_obj3.best_params_)
{'n_estimators': 175}
{'max_depth': 25}
{'min_samples_split': 2}
{'min_samples_leaf': 1}
In [168]:
print(grid_obj.best_score_)

print(grid_obj1.best_score_)

print(grid_obj2.best_score_)

print(grid_obj3.best_score_)
0.929605538157505
0.9268503927186572
0.9282778728636852
0.9282778728636852
In [ ]:
 
In [169]:
grid_fit = RandomForestClassifier(random_state=7,max_features='sqrt',class_weight='balanced_subsample',
                                        n_estimators= 175,
                                        max_depth= 25,
                                        min_samples_split= 2,
                                        min_samples_leaf= 1)
                                     
In [170]:
grid_fit.fit(X_train, y_train)
Out[170]:
RandomForestClassifier(class_weight='balanced_subsample', max_depth=25,
                       max_features='sqrt', n_estimators=175, random_state=7)
In [ ]:
 
In [171]:
importances = grid_fit.feature_importances_
indices = np.argsort(importances)[::-1] # Sort feature importances in descending order
#indices = np.argsort(importances)[0:9:-1] # Sort feature importances in descending order
names = [X_train.columns[i] for i in indices] # Rearrange feature names so they match the sorted feature importances
plt.figure(figsize=(40, 7))#.plt(kind='barh)'# Create plot
plt.title("Feature Importance") # Create plot title
plt.bar(range(X_train.shape[1]), importances[indices]) # Add bars
plt.xticks(range(X_train.shape[1]), names, rotation=90) # Add feature names as x-axis labels
plt.show() # Show plot
In [ ]:
 
In [ ]:
 
In [ ]:
 

Random Forest helped us identify the Top 10 most important indicators (ranked in the table below) as: Second term earned hours, first term earned hours, Distance from home, Family contribution, cost of attend, Gross Financial needs Mother high edu CD Father High edu CD Students Gender Housing sts

EVALUATION

In [172]:
## Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test, grid_fit.predict(X_test))
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[172]:
Text(0.5, 352.48, 'Predicted label')
In [173]:
# percentage of accuracy score approx=(731+669)/(731+74+133+669)
In [174]:
print('Accuracy of RandomForest Regression Classifier on test set: {:.2f}'.format(grid_fit.score(X_test, y_test)*100))
Accuracy of RandomForest Regression Classifier on test set: 87.43

The Confusion matrix is telling us that we have 730 + 651 correct predictions and 75+152 incorrect predictions. In other words, an accurac of 85.94%.

Confusion Matrix: A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

The confusion matrix is useful for giving you false positives and false negatives. The classification report tells you the accuracy of your model. Classification Rate/Accuracy: Accuracy = (TP + TN) / (TP + TN + FP + FN) = (732+673) /(673+ 732+ 73+ 129) = 0.88 Recall: Recall gives us an idea about when it’s actually yes, how often does it predict yes. Recall = TP / (TP + FN) = 673 (673 + 129) =0.839

Precision: Precsion tells us about when it predicts yes, how often is it correct. Precision = TP / (TP + FP)= 673/ (673+73) = 0.90

F-measure: Fmeasure = (2 Recall Precision) / (Recall + Presision) = 2 ( 0.84 0.90) / (0.839 + 0.90) = 2(0.756)/1.74=1.512/1.74=0.87

In [175]:
# Classification report for the optimised RF Regression
grid_fit.fit(X_train, y_train)
print(classification_report(y_test, grid_fit.predict(X_test)))
              precision    recall  f1-score   support

           0       0.85      0.91      0.88       805
           1       0.90      0.84      0.87       802

    accuracy                           0.87      1607
   macro avg       0.88      0.87      0.87      1607
weighted avg       0.88      0.87      0.87      1607

In [176]:
grid_fit.fit(X_train, y_train) # fit optimised model to the training data
probs = grid_fit.predict_proba(X_test) # predict probabilities
probs = probs[:, 1] # we will only keep probabilities associated with the employee leaving
grid_fit_roc_auc = roc_auc_score(y_test, probs) # calculate AUC score using test dataset
print('AUC score: %.3f' % grid_fit_roc_auc)
AUC score: 0.929
In [177]:
fpr, tpr, thresholds = roc_curve(y_test, grid_fit.predict_proba(X_test)[:,1]) 
plt.figure()
plt.plot(fpr, tpr, label='Model (Area = %0.2f)' % grid_fit_roc_auc) 
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig('Random Forest_ROC')
plt.show()

Using probabilistic class prediction resulted in AUC score is: 0.926 which is higher than that best score during the optimisation step.

ROC Graphs AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. The green line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

XGBOOST

In [178]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from operator import itemgetter
import xgboost as xgb
import random
import time
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score
import matplotlib.pyplot as plt
from numpy import genfromtxt
import seaborn as sns
from sklearn import preprocessing
from sklearn.metrics import roc_curve, auc,recall_score,precision_score
import datetime as dt
In [179]:
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
import time
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
In [180]:
steps = [('scaler', StandardScaler()), ('XGBoost', xgb.XGBClassifier())]
In [181]:
pipeline = Pipeline(steps)
In [182]:
parameters = {'XGBoost__booster': ['gbtree', 'gblinear', 'dart'],
              'XGBoost__subsample': np.arange(0.5, 1, 0.05),
              'XGBoost__max_depth': np.arange(3, 20, 1),
              'XGBoost__colsample_bytree': np.arange(0.1, 1.05, 0.05),
              'XGBoost__objective': ['reg:logistic','binary:logistic']}
In [183]:
model = RandomizedSearchCV(pipeline, parameters, n_iter=10, scoring='roc_auc', cv=4)
In [184]:
start_time = time.time()
In [185]:
model.fit(X_train, y_train)
Out[185]:
RandomizedSearchCV(cv=4,
                   estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                             ('XGBoost',
                                              XGBClassifier(base_score=None,
                                                            booster=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            gamma=None,
                                                            gpu_id=None,
                                                            importance_type='gain',
                                                            interaction_constraints=None,
                                                            learning_rate=None,
                                                            max_delta_step=None,
                                                            max_depth=None,
                                                            min_child_weight=None,...
                                                             'dart'],
                                        'XGBoost__colsample_bytree': array([0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 ,
       0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                        'XGBoost__max_depth': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
                                        'XGBoost__objective': ['reg:logistic',
                                                               'binary:logistic'],
                                        'XGBoost__subsample': array([0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95])},
                   scoring='roc_auc')
In [186]:
elapsed_time = (time.time() - start_time)/60
print('Time to tune the model: {0:0.2f} min.'.format(elapsed_time))
Time to tune the model: 1.90 min.
In [187]:
print('Tuned Model Parameters: {}'.format(model.best_params_))
Tuned Model Parameters: {'XGBoost__subsample': 0.9500000000000004, 'XGBoost__objective': 'reg:logistic', 'XGBoost__max_depth': 12, 'XGBoost__colsample_bytree': 0.9500000000000003, 'XGBoost__booster': 'dart'}
In [188]:
predictions = model.predict(X_test)
In [189]:
pred_prob = model.predict_proba(X_test)[:,1]
In [190]:
cm = pd.DataFrame(confusion_matrix(y_test, predictions))
cm['Total'] = np.sum(cm, axis=1)
cm = cm.append(np.sum(cm, axis=0), ignore_index=True)
cm.columns = ['Predicted No', 'Predicted Yes', 'Total']
cm = cm.set_index([['Actual No', 'Actual Yes', 'Total']])
print(cm)
            Predicted No  Predicted Yes  Total
Actual No            730             75    805
Actual Yes           129            673    802
Total                859            748   1607
In [191]:
print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.85      0.91      0.88       805
           1       0.90      0.84      0.87       802

    accuracy                           0.87      1607
   macro avg       0.87      0.87      0.87      1607
weighted avg       0.87      0.87      0.87      1607

In [192]:
logit_roc_auc = roc_auc_score(y_test, model.predict(X_test)) 
In [193]:
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1]) 
plt.figure()
plt.plot(fpr, tpr, label='Model (Area = %0.2f)' % logit_roc_auc) 
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig('XGBoost_ROC')
plt.show()

ROC Graphs AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. The green line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). This is the worst situation. When AUC is approximately 0.5, model has no discrimination capacity to distinguish between positive class and negative class.

In [194]:
# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, grid_fit.predict_proba(X_test)[:,1])
plt.figure(figsize=(14, 6))

# Plot XGboost ROC
plt.plot(fpr, tpr, label='XGBoost (area = %0.2f)' % logit_roc_auc)
# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % grid_fit_roc_auc)
# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

In Machine Learning, performance measurement is an essential task. So when it comes to a classification problem, we can count on an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics)

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: